Hello, I have a VCF file where copy number variations are listed in this format:
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of this structural variant">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=TCN,Number=1,Type=Integer,Description="Total copy number">
##FORMAT=<ID=MCN,Number=1,Type=Integer,Description="Minor allele copy number">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOUR
1 564620 . A <CNV> . . SVTYPE=CNV;END=232864203 GT:TCN:MCN ./.:2:1 ./.:2:1
1 232864349 . G <CNV> . . SVTYPE=CNV;END=232917630 GT:TCN:MCN ./.:2:1 ./.:3:1
1 232917822 . A <CNV> . . SVTYPE=CNV;END=249198692 GT:TCN:MCN ./.:2:1 ./.:2:1
(I included only relevant fields)
I need some sort of porting from this format to the PCAWG-11 Calibration format, which is expressed like:
chromosome start end copy_number minor_cn major_cn cellular_prevalence
1 640305 239120876 2 1 1 0.94
2 59261869 91121847 0 0 0 0.88
I was thinking about writing a converter myself, but I seem to be missing some information (I have little to no bioinformatics experience). In particular:
- where do I find the
start
value in the VCF file? Is it thepos
column? - where do I find the
major_cn
value in the VCF file? From what I see, only theminor_cn
information is obtainable - how can I calculate the
cellular_prevalence
field? If I'm right, one should be able to calculate it somehow
Also, it would be great if you can (possibly) point me to some converter already there to spare me the pain of coding it from scratch, I tried to google for converters a bit but didn't find anything useful.
Thank you for your replies.