Chinese Cancer Genomic Database-Esophageal Squamous Cell Carcinoma (CCGD-ESCC) is a comprehensive database for searching the associations of single nucleotide polymorphisms (SNPs) with the risk of ESCC (genome-wide association studies, GWASs), patients’ survival (Survival GWAS) or gene expression (expression quantitative trait loci, eQTLs). All the information is provided by Dr. Dongxin Lin and Dr. Chen Wu’s laboratory in National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences. CCGD-ESCC also includes the results of the associations between somatic mutations (single nucleotide variants and indels, SNVs/indels) and survival time in 675 patients, a combined sample from 7 published studies of ESCC whole-exome sequencing (WES) or whole genome sequencing (WGS).
ESCC remains the prevalent type of esophageal carcinoma worldwide and especially in economically developing countries. Approximate half of the world’s 500,000 new ESCC cases each year occur in China, where ESCC ranks as the fourth most common cause of cancer-related death and kills ~250,000 people per year. ESCC is hard to detect in its early stage and most patients are in advanced disease stage when diagnosed and, thus, the long-term outcome of this malignancy is dismal, with 5-year survival rates being around 30%.
The etiology of ESCC remains unclear, and epidemiological studies suggest that tobacco smoking, heavy alcohol drinking, micronutrient deficiency and dietary carcinogen exposure may cause this malignancy. However, only a portion of exposed individuals develop ESCC, indicating that individual’s genetic makeup may also play an important role in esophageal carcinogenesis. GWAS is a powerful method in interrogation of genome-wide variants associated with complex human diseases and many genetic susceptibility loci for various types of human cancer have been identified with this approach.
Although surgery, chemotherapy and radiotherapy are frequently used to treat ESCC, the long-term outcome of this cancer is still dismal, with 5-year survival rates around 30%. It has been observed that some demographic and clinicopathological characteristics of patients, such as age, sex, history of alcohol consumption and tobacco smoking, tumor stage and lymph node metastasis, have an impact to some extent on length of survival in ESCC. Nevertheless, these clinical and lifestyle features can only partially explain the great heterogeneity in survival times for affected individuals. Studies have suggested that germline genetic variability can provide important prognostic information for those with cancer.
eQTLs are regions of the genome containing SNPs that influence the expression level of one or more genes. They help researchers to understand why risk variants susceptible to certain disease. The mapping and positional cloning of an eQTL may reveal an expression regulator such as a transcription factor or small regulatory RNA. Besides, studies have shown that genetic variants reproducibly associated with complex diseases or phenotypes are found to be significantly enriched for eQTLs. Therefore, eQTL analysis is a good way to figure out the gene regulatory networks behind phenotypes, providing important insights into the underlying genetic mechanism of complex traits.
There are currently no specific molecule-targeting agents for ESCC treatment. Several studies on WES or WGS of ESCC in Chinese populations and Japanese populations have been published recently. These studies reported an extremely high frequency of TP53 mutations and low prevalence but statistically significant SNVs in several other genes including CDKN2A, NOTCH1, RB1 and PIK3CA. We have comprehensively characterized the genomic landscape features in ESCC that may identify potential targets that may guide to develop precision treatment and prevention of this malignancy. However, the associations between somatic mutation (SNVs/indels) and patients’ survival have not been systematically explored yet.
CCGD-ESCC integrates results of GWAS of 2,022 ESCC cases and 2,039 controls, survival GWAS of 1,006 ESCC patients, eQTL of 94 ESCC patients, SNVs/indels in the protein-coding regions from WES or WGS and their associations with survival of 675 ESCC patients. The criteria of enrollment of study subjects have been described in the previous publications (Wu C, et al. 2011, 2012, 2013; Chang J, et al. 2017). Informed consent was obtained from all participants, and this study was approved by the Institutional Review Board of the Chinese Academy of Medical Sciences, Cancer Hospital.
Germline genotype data of 2,022 cases and 2,039 controls were obtained using Affymetrix GeneChip Human Mapping 6.0 set (Affymetrix). To increase power and coverage for association analyses, we used MACH software to impute untyped SNPs in whole genome with the 1000 Genomes Phase 3 ASN (Asian) panel as the reference. We then filtered out SNPs for low imputation quality (rsq < 0.3), low minor allele frequency (MAF < 0.01) and divergence from Hardy-Weinberg equilibrium (P < 1.0 × 10-6). Consequently, 8,252,518 SNPs and 8,279,620 SNPs were used in the final GWAS and survival GWAS analyses, respectively.
High coverage whole-genome sequencing of DNA samples from 94 ESCC was performed on the Illumina HiSeq X Ten System (Chang J, et al. 2017). The sequence reads were aligned to the genome (hg19) using the bwa mem (v0.7.4) with default parameters. Duplicates were removed using Picard. The somatic mutations were called by Strelka. Final somatic calls were filtered to require a Q score of >30 and were further annotated by snpEff and snpSift. The criteria of SNV calling from other WES/WGS have been described in the previous publications (Song Y, et al. 2014; Gao YB, et al. 2014; Zhang L, et al. 2015; Sawada G, et al. 2016; Qin HD, et al. 2016; Cancer Genome Atlas Research. 2017).
We obtained whole genome sequencing and RNA sequencing data in 94 ESCC samples to perform eQTL analysis. Germline variants were called using Freebayes (v0.9.21-19-gc003c1e). SNPs were then filtered out for low minor allele frequency (MAF < 0.05) and divergence from Hardy-Weinberg equilibrium (P < 1.0 × 10-6). After quality control, a total of 6,092,313 SNPs was used in cis-eQTL analysis.
We performed RNA sequencing of both tumor tissues and paired distant normal tissues from 94 ESCC. Illumina TruSeq mRNA libraries were prepared according to the manufacturer’s instructions. Total RNA of each sample was extracted and sequenced by Illumina HiSeq2000 with a total of 10 Gb data. RNA reads were aligned to the UCSC human genome release version hg19 (Genome Reference Consortium GRCh37) using Mapsplice. Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.1 and normalized to a fixed upper quartile of total reads within the sample. Expression values were log2-transformed [log2(RSEM+1)] for further analysis. Gene counts were quantified for 20,531 genes. Of all genes, genes mapping to sex chromosomes and genes with absent call in over 90% of the samples were excluded from further analysis, leaving 18,085 genes for cis-eQTL analysis.
For each of the selected 8,252,518 SNPs, we carried out association analysis using an additive model in a logistic regression (1-degree-of-freedom) analysis with age, sex, smoking status and drinking status as covariates. The ORs calculated are presented for the minor allele of each SNP.
Overall survival times for the 1,006 ESCC patients were calculated from the date of diagnosis to the date of last follow-up or death. For each of the selected 8,279,620 SNPs, P values were assessed using an additive or dominant model in log-rank tests.
We performed cis-eQTL analysis for 6,092,313 SNPs and 18,085 genes in tumor tissues and paired distant normal tissues from 94 ESCC patients, respectively. The associations between genotype and gene expression in cis were evaluated by testing all SNPs within a 100 kb window of upstream and downstream of the transcription start site (TSS) of a given gene in the same chromosome. The R package MatrixEQTL was performed to test a total of 7,676,942 gene-cis-SNP pairs and significant associations were defined by calculating false discovery rate (FDR) < 0.05.
We combined our WGS data with WES data obtained from published studies, including 90 ESCC samples from TCGA database to increase sample size to 675 patients having the survival information for analyzing the associations between SNVs/indels in the protein-coding regions and patients’ survival. Log-rank test was used for examining the difference, with P < 0.05 as a threshold of significance.
R software (v3.2.4) was used for general statistical analysis.