Using THETA2 from CNVkit: what king of vcf do you need?
1
0
Entering edit mode
7.6 years ago
gil.hornung ▴ 100

Hi,

I want to use the "export theta" functionality of CNVKit estimate the tumor purity based on THETA2 program.

Part of the input for THETA2 are files with SNP counts for the Tumor and Normal samples formatted like:

#Chrm   Pos     Ref_Allele      Mut_Allele
10      104427  74      1
10      111955  54      0
10      135656  0       94

To my best understanding these should be the germline variants in the Tumor and Normal samples, because they are used to estimate the biallelic fraction (BAF).

Based on the CNVkit manual cnvkit export theta accepts a vcf file:

cnvkit.py export theta Sample_T.cns reference.cnn -v Sample_Paired.vcf

However it is unclear what kind of VCF is it. If the germline mutations are important then the VCF output of programs such as MuTect2 are not appropriate, because they are geared towards somatic mutations and discard of the germline mutations. Should I use the output of HaplotypeCaller? But then how is the Sample_Paired.vcf organised? And furthermore, should I filter the VCF to include only PASS mutations?

Am I missing out on something?

Thank you,

Gil

cnvkit theta2 tumor purity • 3.8k views
ADD COMMENT
0
Entering edit mode

Have you seen the details mentioned here:

http://cnvkit.readthedocs.io/en/latest/fileformats.html

And here,

https://github.com/samtools/hts-specs

Let us know what issues you faced if you have seen these pages and if they did not work.

ADD REPLY
0
Entering edit mode
7.5 years ago
Eric T. ★ 2.8k

CNVkit's VCF processing works best with GATK HaplotypeCaller or FreeBayes on the tumor-normal pair, with both samples shown and somatic variant records marked with SOMATIC in the INFO column.

ADD COMMENT
0
Entering edit mode

Thank you Eric and sridhar56, If possible, can you provide a small VCF that follows the proper specs as an example? It would things much clearer.

Gil

ADD REPLY
0
Entering edit mode

Yes, here's an example VCF included in CNVkit's test suite: https://raw.githubusercontent.com/etal/cnvkit/master/test/formats/na12878_na12882_mix.vcf

ADD REPLY
0
Entering edit mode

Just as a reference, here is the GATK command I used to extract high-quality heterozygous SNP from HaplotypeCaller output:

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-T SelectVariants \
-V haplotype_caller.vcf \
-o het.vcf \
--excludeFiltered \
--selectTypeToInclude SNP \
--restrictAllelesTo BIALLELIC \
-select '(vc.getGenotype("normal_name").isHet())&&(vc.getGenotype("normal_name").getAD().1>={params.min_normal_ALT})&&(vc.getGenotype("tumor_name").getDP()>{params.min_tumor_DP})'
ADD REPLY
0
Entering edit mode

Yes, this is similar to what CNVkit does internally when you give it a VCF.

ADD REPLY
0
Entering edit mode

Does this command line:

cnvkit.py export theta Sample_T.cns reference.cnn -v Sample_Paired.vcf

currently support the VCF outputs from Mutect2 now? A unfiltered output VCF file from Mutect2 should also contain germline variants (as indicated by "germlink_risk" in the FILTER column.

May be I'm wrong?

ADD REPLY
0
Entering edit mode

You'd think so, but Mutect also tends to filter the germline variants even when you tell it not to, leaving relatively few SNPs that CNVkit can use for the BAF calculation. It's better to use HaplotypeCaller to get comprehensive germline SNP calls.

ADD REPLY

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6