Dear Biostars community,
Pretty similar question to my last one (C: Filtering a multi-person structural variant VCF file by type of SV) but with a twist!
I have 100 VCF files each a "structural" VCF file which is a per-person VCF file with made up of calls made with Manta and Canvas. The files look like this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
I want subdivide the files by type of structural variant which are found in the ID column and are made up of: - MantaINS - MantaINV - MantaDEL - MantaBND - MantaDUP - Canvas:REF - Canvas: GAIN - Canvas:LOSS
I.e. for each input vcf come out with 8 subdivided files and in total have 800 output files.
I have run:
for T in MantaDEL MantaINS MantaBND
do
bcftools view -S paths_to_files.txt | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $1,$2,$3;}' > ${T}.txt
done
Which I am sure is wrong as I need to run things in parallel I think or do some scripting.
Your help would be greatly appreciated.
Many thanks
Make a list file with Manta codes:
Then do,
There are thousands more alternative for doing this!
Thanks Reza! when using grep my "your VCF.file" is in fact a text file with the file paths to each file: looking like this (just s sample):
Will grep work in this case?
Many thanks
Omid
I see! I was under impression that you have a multi-call vcf file containing all variants across your samples. In that case you need to generate a multi-call vcf. You can use either vcftools or GATK but I recommend GATK: https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one
The syntax for your purpose would be along these lines:
Thanks again Reza for taking the time to reply. I actually have a multi-callVCF file already but was more interested in looking at SV variation between my cohort so thought generating individual patient VCF files by type of SV would be informative.
Many thanks
Omid
So
grep
should work absolutely fine with yourvcf
. Lets assume your multi-call VCF file is:output.vcf
should do the trick!
Also have a look at SnpEff "Primary variant annotation and quality control": http://snpeff.sourceforge.net/protocol.html