Question

Finding significant overlap of mutations in a specific cancer tumors and its reprogrammed clones across all somatic mutation data from TCGA across all tumor types

2

Entering edit mode

10.0 years ago

ivivek_ngs ★ 5.2k

I would like to seek some advice regarding a kind of analysis I am planning to perform with my exome data. I have sequenced data from two patients of specific cancer. I have also sequenced the peripheral blood as matched control for the patients. I am also having sequence of tumor iPSCs ( which means we reprogrammed the tumor lines to its iPSC and then sequenced it). We do not have a control exome data of normal iPSC(we did not reprogram normal fibroblasts to generate normal iPSCs as control for the tumor iPSCs) here. So the somatic variants for the iPSC is being obtained from normal peripheral blood exome / iPSC derived from tumor pair. So for each patients I have 4 samples for which exome sequencing is done. 1 normal, 1 tumor and 2 iPSC lines . My idea is to find the mutational landscape that is conserved from tumor to its tumor reprogrammed clone. We are not considering the dosage effect or the number of passages at which the reprogramming is done, so clearly there might be a selective advantage of mutations due to reprogramming that might occupy the majority of the IPSC clone. We know that the tumor is polyclonal and the IPSC is a single clone so the IPSC should contain the mutation that is actually spread in highest frequency in the tumor clones (barring the fact of selective advantage and other acquired mutation due to reprogramming). Still I can expect some mutation will pass to iPSC and gain precision from the tumor and also have elevated frequency. To this I employed established variant callers to fish out somatic variants from my samples and tried to find the to what extent these somatic variants are actually conserved in the tumor iPSCs. The overlap was fairly not convincing enough and the extent is roughly 44%. Now I want to do a check of these variants across all somatic mutations that I can obtain from TCGA for all tumor types. I have not worked with MAF files from TCGA much but after some studies on posts and websites I figured out we do not have a comprehensive mutation file that catalogs somatic mutations for all cancer types. We have it at individual level for each cancer types. I am interested to see the somatic variants which I have extracted for my samples(since they are not from large cohort of samples), are they somehow significantly observed as cancer related mutations across all types of cancer and I did not obtain them by chance. This would ensure me that even the mutational burden that the iPSC has, even not an exact mimic of its tumor but still the mutations are relevant and tumorigenic. This will give me a fist hand validation on my variants. Now my question is how do I obtain such a mutation file which will be having somatic mutations across most of the cancer types which its genomic loci, gene name, read statistics to which I can try to interrogate my variant data. Can this be achieved? Shall I do it separately across different cancer types taking up the MAF files for each tumor type and interrogate my somatic variants with them? This is what I want to achieve as of now. I would like some inputs out here from people out here. If someone has some other ideas I would like to know about it as well. Which data should I be consulting for this. I am sure it should be the MAF but am a bit lost among the TCGA consortium. Any leads?

Thanks and Regards

VD

sequencing SNP tcga maf • 6.5k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by ivivek_ngs ★ 5.2k

Ram · Accepted Answer · 2014-11-16

2

Entering edit mode

10.0 years ago

Cyriac Kandoth 6.1k

Mutations shared/unique between tumor and matched IPSC: Fewer shared mutations are possibly due to undetected sub-clonal mutations in your tumor that gain a higher cell fraction in your IPSC. If the tumor was sequenced deep enough, then you can use sensitive variant callers like MuTect and LoFreq to tease out variants at low cancer cell fractions (CCF). Note that LoFreq is not a somatic caller, but you can subtract out any calls seen at a sufficiently high variant allele fraction (VAF) in the matched normal. Also run VarScan and Strelka for indels, and for some high VAF point mutations that MuTect might miss.

Matching somatic calls against TCGA: I'd recommend using the merged MAFs and cancer driver gene lists from the Pan-Cancer project. Fetch pancan12_cleaned_filtered.maf from here. However, it is rare for cancer driving mutations to have the exact same genomic locus between samples. It is better to find the genes that each of your mutations alter (try vcf2maf), and then match those to the cancer driver genes predicted by the TCGA. Lists of cancer drivers can be found in tables/supplements from these publications. Mismatched gene names (MLL is now KMT2A) or mismatched amino-acid loci (depending on gene isoform) can cause issues, so try this.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

@Cyriac Kandoth

Thanks a lot for the detailed note on the analysis. I have already used VarScan and Mutect and GATK as well. GATK did not work out the way I was expecting since I have a lot of heterogenity in my tumor sample. I believe the fact of having sub clonal mutation in my tumor samples there is a lot of noise and for this reason these mutations are gaining higher cell fraction in the iPSC, but is there any way to extract them? I have used Mutect for which the mutations which I found are much more than that detected by VarScan. Obviously the sharing level between tumor and iPSCs does not change much. My data is not that deep as far as recent exome experiments are concerned and the evolution of the technology. My normal and tumor samples are sequenced at 70X and iPSCs are at 35X. Which is not very deep but we expected that this coverage was enough to extract the mutations and shared context of mutational events. But now I feel since the subclonal mutations are taking precision I would like to do see if I can get a deeper sequencing done on my samples. Also the fact is am having just 2 tumors and match normal of them and 2 iPSCs for each of the tumor which is not that a big cohort where the two tumors are of different grades. I will try with the callers advised and see the effect and also will do the matching against the TCGA. Thank you for the suggestion.

Regards,
VD

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

MuTect/VarScan skip calling mutations with insufficient supporting reads. So for all somatic mutations detected in the IPSC, try using tools like samtools mpileup or bam-readcount to find at least a few reads that support the same variant in the tumor sample. Even 1 or two reads supporting the same variant can be sufficient evidence, if it is safe to rule it out as germline or a recurrent artifact. You can try fpfilter, a script that runs bam-readcount to collect evidence for or against a given list of variants.

ADD REPLY • link 10.0 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

@Cyriac Kandoth

I have some problem with the output of the fpfilter file. I can filter it out to form the tab delimited high confidence SNVs but I cannot convert it with the vcf-annotate file. I am using the below command. Do I have to provide the description text and annotation text to get the desired vcf file for the fpfilter out file? I am sorry am asking in this thread but its the immediate downstream analysis of the samples. Is it necessary to pass annotation and description text? I believe it should directly convert using the fpfilter out file . Below is the command and the error am using. I could not find any assistance so am putting it here. Thanks.

cat S_313_T_soma_snvs.fpfilter | /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate -f FILTER=PASS > S_313_T_soma_snvs.fpfilter.vcf

Error:

perl: warning: Falling back to the standard locale ("C").
Use of qw(...) as parentheses is deprecated at /usr/share/perl5/Vcf.pm line 1622.
Use of uninitialized value $key in exists at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 259.
Use of uninitialized value $key in exists at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 259.
Use of uninitialized value $key in concatenation (.) or string at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 259.
The filter [] not recognised.
 at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 42
    main::error('The filter [] not recognised.\x{a}') called at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 259
    main::parse_filters('HASH(0x20ced10)', 'FILTER=PASS') called at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 120
    main::parse_params() called at /scratch/GT/softwares/vcftools_0.1.12b/bin/vcf-annotate line 32

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by ivivek_ngs ★ 5.2k