Hello all,
First post so apologies for any etiquette breaches.
I have a large cohort of over 1000 patients who have had WGS done and then Canvas/Manta structural variant caller applied to their data. I have then merged these into a single VCF file.
In the ID (3rd column) I have the types of call made by one of the two alogrithms e.g. MantaDEL, MantaINV, CanvasGAIN etc etc. After each piece of text there is a string of number representing the graph plot associated with how that call was generated e.g. MantaDel0:1:3:00 which is unique to that call.
I am trying to create individual VCFs of each unique type of call. However, if I run the following code to see what the unique SV types are:
bcftools query -f '%ID/n' SV.vcf.gz
I unsurprisingly get a massive list of each "ID" with the unique number string after it when all I want is the first part, as in to know what are the types of SV call so I can then use bcftools to create individual file types for each.
I hope this makes sense and apologies if this has been covered elsewhere.
Many thanks
just curious: how did you merge those SV calls ?
I used bcftools to merge the files.
it won't work. The very "same" CNVs can be localized at the same place but with a few difference of number of bases.
eg:
I was hoping once I had created the filed to merge calls based on similar distances. I was going to use Plink for the CNV analysis and potentially a few other tools. What would your approach be?
All the best
Omid
an example is needed (input / output )
Input would be a standard vcf file with the ID column being made up of calls similar to this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
And the output in this case would be three VCFs for "MantaINS" "MantaBND" and "MantaDEL".
I was planning on generating a list of all the IDs and then using Grep e.g. grep "MantaINS*" but was wondering if it could be done in one line from BCFtools.
Many thanks