Filtering a multi-person structural variant VCF file by type of SV
1
1
Entering edit mode
5.2 years ago
tacrolimus ▴ 140

Hello all,

First post so apologies for any etiquette breaches.

I have a large cohort of over 1000 patients who have had WGS done and then Canvas/Manta structural variant caller applied to their data. I have then merged these into a single VCF file.

In the ID (3rd column) I have the types of call made by one of the two alogrithms e.g. MantaDEL, MantaINV, CanvasGAIN etc etc. After each piece of text there is a string of number representing the graph plot associated with how that call was generated e.g. MantaDel0:1:3:00 which is unique to that call.

I am trying to create individual VCFs of each unique type of call. However, if I run the following code to see what the unique SV types are:

bcftools query -f '%ID/n' SV.vcf.gz

I unsurprisingly get a massive list of each "ID" with the unique number string after it when all I want is the first part, as in to know what are the types of SV call so I can then use bcftools to create individual file types for each.

I hope this makes sense and apologies if this has been covered elsewhere.

Many thanks

bcftools vcf SV • 2.1k views
ADD COMMENT
0
Entering edit mode

just curious: how did you merge those SV calls ?

ADD REPLY
0
Entering edit mode

I used bcftools to merge the files.

ADD REPLY
0
Entering edit mode

it won't work. The very "same" CNVs can be localized at the same place but with a few difference of number of bases.

eg:

chr1 100 . DEL . . END=1000
chr1 101 . DEL . . END=999
ADD REPLY
0
Entering edit mode

I was hoping once I had created the filed to merge calls based on similar distances. I was going to use Plink for the CNV analysis and potentially a few other tools. What would your approach be?

All the best

Omid

ADD REPLY
0
Entering edit mode

when all I want is the first part, as in to know what are the types of SV call

an example is needed (input / output )

ADD REPLY
0
Entering edit mode

Input would be a standard vcf file with the ID column being made up of calls similar to this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9

2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12

2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

And the output in this case would be three VCFs for "MantaINS" "MantaBND" and "MantaDEL".

I was planning on generating a list of all the IDs and then using Grep e.g. grep "MantaINS*" but was wondering if it could be done in one line from BCFtools.

Many thanks

ADD REPLY
4
Entering edit mode
5.2 years ago
for T in  MantaDEL MantaINS MantaBND
do
   bcftools view input.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $1,$2,$3;}'  > ${T}.txt
done
ADD COMMENT
1
Entering edit mode

Dear Pierre,

Many thanks for this, it worked delightfully well (I changed to "print $0" as I wanted the whole line). I have upvoted and accepted your answer (is that the right thing to do?, let me know if I should celebrate your helpfulness any further)

All the best

Omid

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6