Criteria & Standards
CCGD-ESCC integrates results of GWAS of 2,022 ESCC cases and 2,039 controls, survival GWAS of 1,006 ESCC patients, eQTL of 94 ESCC patients, SNVs/indels in the protein-coding regions from WES or WGS and their associations with survival of 675 ESCC patients. The criteria of enrollment of study subjects have been described in the previous publications (Wu C, et al. 2011, 2012, 2013; Chang J, et al. 2017). Informed consent was obtained from all participants, and this study was approved by the Institutional Review Board of the Chinese Academy of Medical Sciences, Cancer Hospital.
Germline genotype data of 2,022 cases and 2,039 controls were obtained using Affymetrix GeneChip Human Mapping 6.0 set (Affymetrix). To increase power and coverage for association analyses, we used MACH software to impute untyped SNPs in whole genome with the 1000 Genomes Phase 3 ASN (Asian) panel as the reference. We then filtered out SNPs for low imputation quality (rsq < 0.3), low minor allele frequency (MAF < 0.01) and divergence from Hardy-Weinberg equilibrium (P < 1.0 × 10-6). Consequently, 8,252,518 SNPs and 8,279,620 SNPs were used in the final GWAS and survival GWAS analyses, respectively.
High coverage whole-genome sequencing of DNA samples from 94 ESCC was performed on the Illumina HiSeq X Ten System (Chang J, et al. 2017). The sequence reads were aligned to the genome (hg19) using the bwa mem (v0.7.4) with default parameters. Duplicates were removed using Picard. The somatic mutations were called by Strelka. Final somatic calls were filtered to require a Q score of >30 and were further annotated by snpEff and snpSift. The criteria of SNV calling from other WES/WGS have been described in the previous publications (Song Y, et al. 2014; Gao YB, et al. 2014; Zhang L, et al. 2015; Sawada G, et al. 2016; Qin HD, et al. 2016; Cancer Genome Atlas Research. 2017).
We obtained whole genome sequencing and RNA sequencing data in 94 ESCC samples to perform eQTL analysis. Germline variants were called using Freebayes (v0.9.21-19-gc003c1e). SNPs were then filtered out for low minor allele frequency (MAF < 0.05) and divergence from Hardy-Weinberg equilibrium (P < 1.0 × 10-6). After quality control, a total of 6,092,313 SNPs was used in cis-eQTL analysis.
We performed RNA sequencing of both tumor tissues and paired distant normal tissues from 94 ESCC. Illumina TruSeq mRNA libraries were prepared according to the manufacturer’s instructions. Total RNA of each sample was extracted and sequenced by Illumina HiSeq2000 with a total of 10 Gb data. RNA reads were aligned to the UCSC human genome release version hg19 (Genome Reference Consortium GRCh37) using Mapsplice. Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.1 and normalized to a fixed upper quartile of total reads within the sample. Expression values were log2-transformed [log2(RSEM+1)] for further analysis. Gene counts were quantified for 20,531 genes. Of all genes, genes mapping to sex chromosomes and genes with absent call in over 90% of the samples were excluded from further analysis, leaving 18,085 genes for cis-eQTL analysis. The RNA sequencing data have not been published yet.
For each of the selected 8,252,518 SNPs, we carried out association analysis using an additive model in a logistic regression (1-degree-of-freedom) analysis with age, sex, smoking status and drinking status as covariates. The ORs calculated are presented for the minor allele of each SNP.
Overall survival times for the 1,006 ESCC patients were calculated from the date of diagnosis to the date of last follow-up or death. For each of the selected 8,279,620 SNPs, P values were assessed using an additive or dominant model in log-rank tests.
We performed cis-eQTL analysis for 6,092,313 SNPs and 18,085 genes in tumor tissues and paired distant normal tissues from 94 ESCC patients, respectively. The associations between genotype and gene expression in cis were evaluated by testing all SNPs within a 100 kb window of upstream and downstream of the transcription start site (TSS) of a given gene in the same chromosome. The R package MatrixEQTL was performed to test a total of 7,676,942 gene-cis-SNP pairs and significant associations were defined by calculating false discovery rate (FDR) < 0.05.
We combined our WGS data with WES data obtained from published studies, including 90 ESCC samples from TCGA database to increase sample size to 675 patients having the survival information for analyzing the associations between SNVs/indels in the protein-coding regions and patients’ survival. Log-rank test was used for examining the difference, with P < 0.05 as a threshold of significance.