CNVkit - Format of VCF file
1
0
Entering edit mode
8.7 years ago
fongchunchan ▴ 10

I am at the step of deriving absolute integer copy number for each segment and the documentation states that one can pass in a vcf file of SNPs in the tumour samples:

cnvkit.py call Sample.cns -y -v Sample.vcf -o Sample.call.cns

This should extract b-allele frequencies and allow for the calculation of major and minor copy number. I am having trouble finding the exact format of VCF cnvkit needs in order for this work.

I've called SNPs using bcftools. Specifically, in the tumor and normal separately and then intersecting positions found in both (using bcftools isec). Then passed the vcf file of this format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleA
1       926351  .       C       T       11.1    .       DP=2;VDB=0.0106;AF1=1;AC1=2;DP4=0,0,1,1;MQ=60;FQ=-33    GT:PL:GQ        1/1:42,6,0:9
1       1474167 .       A       G       8.65    .       DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-30       GT:PL:GQ        1/1:38,3,0:5

When running I get an error like this:

Skipping 1:926351 C; unsure how to get alternative allele count: CallData(GT=1/1, PL=[42, 6, 0], GQ=9)
Skipping 1:1474167 A; unsure how to get alternative allele count: CallData(GT=1/1, PL=[38, 3, 0], GQ=5)

Seems like it doesn't know how to extract the relevant pieces of information from the VCF file. Does cnvkit accept vcf output from a separate SNP calling tool?

Thanks,

cnvkit • 3.2k views
ADD COMMENT
1
Entering edit mode
8.7 years ago
Eric T. ★ 2.8k

Here's the relevant code in CNVkit. The parser checks for FORMAT fields "AD" (the most commonly seen one), "CLCAD2" (a vendor-specific code), or "AO" (I don't remember which caller emits this).

The problem CNVkit has with your VCF file is that the sample-specific data is not stored in the sample-specific columns. Is the alt allele count stored in the INFO column instead, e.g. the "AC1" field? Did bcftools put it there? If this is a standard thing that other users are likely to have then I can add a check in CNVkit to extract this field if it's there. Otherwise, could you try copying the relevant INFO field into the sample column using the "AD" or "AO" field?

ADD COMMENT
0
Entering edit mode

Thanks for the reply.

Based on the VCF header produced by the bcftools:

INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">

So it would appear that the allele information is in the DP4 field and it is comma separated. This is direct output of bcftools. I'll try to use another germline mutation caller that outputs the allele data into AD.

ADD REPLY

Login before adding your answer.

Traffic: 1090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6