Dear fellow cancer researchers,
In the last few years I have been working as a bio-informatician in a PhD-project aimed at improving retinoblastoma treatment and diagnosis through genomic characterization of human and murine tumor samples. Although retinoblastoma was my main project, I have been involved in various other cancer genomics experiments advising about study design and helping with data processing, analysis and interpretation. Currently, I am writing the last parts of my thesis including a paragraph providing some general recommendations for cancer genomics experiments. Because I think these recommendations might serve the biostars community as well, here I would like to share them with you.
Expressions in this blog post are based on my personal experience and opinion and therefore, references to scientific articles are omitted.
Multidisciplinary: Cancer genomics should be contemplated as a multidisciplinary research field. It requires active participation of at least a pathologist, a clinician that treats the patient group, a biologist, a statistician and a bio-informatician who all should have minimally 4 years of relevant working experience.
Study design: After the research aims have been defined by the multidisciplinary cancer genomics team, discussion of study design prior to the execution of any experiment should be top priority. What sample preparation is required? Which genomics technique is most applicable? How much sequencing data is required? What are the endpoints of quality control? How many samples are required for an experiment to have sufficient power? What are relevant comparisons? Are technical replicates required? Is it possible to perform paired experiments? What negative and positive controls are essential to interpret the results?
Pilot study: In some cases, it can be hard to answer some of the questions related to the study design. For example, the number of sequencing reads required for accurate detection of transcript structures in complex cancer samples depends on many factors, which cannot all be predicted beforehand. Therefore it can be very useful to perform a series of pilot experiments. For example, instead of sequencing an NGS library at the estimated desired depth instantly, the library can be sequenced in series where the number of sequencing reads is gradually increased until a plateau in quality is achieved. By doing so, not only the cost-quality balance is optimized, this also warrants high quality data acquisition.
Between-tumor heterogeneity: Cancer is a heterogeneous disease, where tumors not only differ between cancer types, but also within cancer types. Studies that aim to identify common events need to realize that in an unselected heterogeneous cohort, a large sample size will be required to identify commonalities. Alternatively, selection of samples in order to decrease the sample population diversity can drastically increase the chances of finding common traits. Conversely, studies aimed to relate genotypes to phenotypes might benefit from large population diversity and should therefore randomly sample or perform stratified subsampling to warrant sample diversity to increase the study power.
Within-tumor heterogeneity: Next to between-tumor variability, significant diversity between tumor cells within the same tumor has been described. Although there is increasing appreciation of within-tumor heterogeneity, still many cancer genomic studies depend on a single sample per tumor, or even per patient. A single sample of a polyclonal tumor in a patient with multiple lesions might therefore be inadequate to base either fundamental or clinical conclusions on.
Non-tumor cell contamination: In my opinion, one of the most common pitfalls of cancer genomics studies is that non-tumor cell contamination is not reported or even assessed. For example it was recently reported that for whole genome sequencing from 14,000 patients by the International Cancer Genome Consortium, in 92% of the cases, data about non-tumor cell contamination was not available. Particularly for DNA analysis, contamination with diploid wild-type non-tumor DNA confounds the identification of tumor variants and should be accounted for. Depending on the research questions, non-tumor cells should either be removed from the sample prior to profiling or should be accounted for during analysis. Of note, there are numerous bioinformatic tools that claim to accurately determine tumor cellularity using copy number and allele frequency data. However, they often depend on assumptions about ploidy and heterogeneity that cannot always be guaranteed, in my believe. The best way to determine non-cancer cell contamination is to evaluate the variant allele frequency of the tumor initiating mutation, which should be present in all tumor cells and not in any non-tumor cells. Admittedly, this method also depends on the assumption that tumor evolution is hierarchical, where all tumor cells descended from a single initiating tumor cell and daughter cells inherited and maintained the initiating mutation.
Sample identification: Whatever can go wrong, will go wrong. During the process of collecting tumor samples for genomic profiling, it is not unlikely that samples can get swapped accidently. Therefore it can be very useful to be able to validate the identity of the individual samples. For example, in case independent genetic data is available about included samples, this can be used to verify sample identities. To the least, in DNA or RNA studies, respectively copy number or gene expression of sex chromosomes should be correlated with sex phenotypes. In case tumor-normal DNA or RNA profiling is performed, hierarchical clustering of single nucleotide polymorphism (SNP) genotypes can be used to validate that the appropriate sample pairings were used.
Get best out of data: Although a genomics approach might be designated to collect a particular kind of data, it might also be used for other purposes. For example, SNP arrays were first only used for SNP genotyping but are now widely used for copy number determination additionally. Similarly, exome-sequencing is primarily performed for SNV/INDEL detection but can be used for copy number, loss of heterozygosity or even virus quantification analysis. Another example is RNA sequencing, which is mostly used for determination of transcript abundance or structure, but can also be used for SNV/INDEL analysis.
Data sharing: There is increasing awareness that sharing high-dimensional genomics data is essential for the cancer genomics field to make sustainable translational contributions. This way, safe long-term data storage is ensured, independent researchers can reproduce and thereby validate analyses, perform powerful meta-analysis or use publicly available data to help interpret new data.
Phenotype data: Well-documented, complete and validated phenotype data is absolutely essential for the translational interpretation of genomic data. Considerable time and effort should be devoted to collect high-quality phenotype data. Most importantly, similarly to genomics data, sharing of phenotype data with due consideration of patient consent can greatly enhance the force of cancer genomics in the war on cancer.
As a final remark, I would like to stress that although cancer genomics studies yielded many insights into the molecular pathways that drive cancer, cancer genomics researchers should temper their optimism about its contribution to improvement of health care. Surely, next-generation sequencing revolutionized genetics, leading to an exponential increase in sequencing throughput. However, although numerous clinical trials with targeted therapies based on cancer genomics are currently in progress, the revolutionizing impact of cancer genomics on long-term survival rates remains to be demonstrated in the coming years. In combination with targeted drug delivery, immune therapy and combinatorial medicine, identification of the tumor-specific molecular essentials will hopefully open up new avenues for the save and effective control of cancer.
This is a very nice and compact summary Irsan. I will link to this post in my teaching materials.
Indeed nice summary. Just one comment regarding your recommendations about normal contamination determination: Most tools utilize information from multiple sources (germline SNPs, coverage, somatic mutations, prior knowledge) and are fairly robust against violations of assumptions*. For about 90% of samples passing basic QC metrics, you will get highly concordant (within 5-10%) and correct purity estimates when you run multiple state-of-the-art tools. This is not surprising, since we try to measure somatic events and finding the fraction of data without these events shouldn't be too hard.
It is true that ploidy inference is more difficult, especially in poly-genomic or low purity samples. In some cases, for example in tumors with low SNV mutation rate when pretty much all copy number losses are sub-clonal, this can also lead to wrong purity estimates, but this is rare for most cancer types.
*I wouldn't call them assumptions. Rather, different strategies on how to punish complex models (high ploidy and/or heterogenity) over simpler ones in noisy data. Finding robust and dynamic (purity/noisiness) strategies is IMHO one of the few remaining challenges.
Excellent post! Thanks for sharing!