Dear Eric, dear all, from the cnvkit documentation :
Typically you would use a properly formatted VCF from joint tumor-normal SNV calling, e.g. the output of MuTect, VarDict, or FreeBayes, having already flagged somatic mutations so they can be skipped in this analysis. If you have no matched normal sample for a given tumor, you can use 1000 Genomes common SNP sites to extract the likely germline SNVs from a tumor-only VCF, and use just those sites with THetA2 (or another tool like PyClone or BubbleTree).
I am currently trying to do the same as above mentioned. I have made a .cnn reference from unrelated but aged matched WES files with obatiained from the same hybridziation based method. I have filtered my VCFs for common dbsnp SNPs with AF of more that 10% (very very common).
my cns file looks like this
chromosome start end gene log2 depth probes weight
chr1 12403 2990008 DDX11L1,WASH7P,FAM138F,MIR4251 -2.49508 14.0935 1681 555.806
chr1 2992142 6337653 PRDM16,MIR4251,ACOT7 -4.62846 7.05581 1631 483.703
my vcf file without header looks like this
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Tumorsample1
chr1 762273 . G A . REJECT DP=367;AF=1;DP4=0,0,119,248;SB=0;ANNOVAR_DATE=2018-04-16;ExAC_ALL=0.8060;ExAC_AFR=0.4415;ExAC_AMR=0.8116;ExAC_EAS=0.9174;ExAC_FIN=0.9;ExAC_NFE=0.8384;ExAC_OTH=0.8896;ExAC_SAS=0.8184;Func.refGene=ncRNA_exonic;Gene.refGene=LINC00115;GeneDetail.refGene=.;ExonicFunc.refGene=.;AAChange.refGene=.;cosmic87_coding=.;ALLELE_END;rs_ids=rs3115849 GT:DP:AF:SB:DP4 .:367:1.0:0:0,0,119,248
when I use export theta tumorsample1.cns -r ref.cnn -v sample1.vcf
Wrote sample1theta Selected test sample sample1 Loaded 44443
records; skipped: 0 somatic, 1648 depth Kept 44443 heterozygous of
44443 VCF records
/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:868:
FutureWarning: Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_lowerdim(tup)
Wrote sample1.tumor.snp_formatted.txt
Wrote sample1.normal.snp_formatted.txt
Unfortunately I get an empty normal SNP file(should it be like this or it is an error?)
#Chrm Pos Ref_Allele Mut_Allele
the tumor snp file has zero as Mut_Allele count
#Chrm Pos Ref_Allele Mut_Allele
chr1 762272 367 0
chr1 808921 153 0
interval file
#ID chrm start end tumorCount normalCount start_1_12403:end_1_2990008 1 12403 2990008 298177 3286017
start_1_2992142:end_1_6337653 1 2992142 6337653 65940 1608940
Theta stops prematurely as the normalMutCount[i] + normalRefCount[i]
is less than min
the common SNPs were extracted but how would you get the BAF of the normal sample?
Best regards and thanks in advance