Question

Filtering multiple single person structural variant VCF files by type of SV

0

Entering edit mode

5.2 years ago

tacrolimus ▴ 140

Dear Biostars community,

Pretty similar question to my last one (C: Filtering a multi-person structural variant VCF file by type of SV) but with a twist!

I have 100 VCF files each a "structural" VCF file which is a per-person VCF file with made up of calls made with Manta and Canvas. The files look like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9

2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12

2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

I want subdivide the files by type of structural variant which are found in the ID column and are made up of: - MantaINS - MantaINV - MantaDEL - MantaBND - MantaDUP - Canvas:REF - Canvas: GAIN - Canvas:LOSS

I.e. for each input vcf come out with 8 subdivided files and in total have 800 output files.

I have run:

for T in  MantaDEL MantaINS MantaBND
do
   bcftools view -S paths_to_files.txt  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $1,$2,$3;}'  > ${T}.txt
done

Which I am sure is wrong as I need to run things in parallel I think or do some scripting.

Your help would be greatly appreciated.

Many thanks

structural variation bcftools vcf unix linux • 2.8k views

ADD COMMENT • link 5.2 years ago by tacrolimus ▴ 140

1

Entering edit mode

Make a list file with Manta codes:

echo "MantaINS" >> SV.list
echo "MantaINV" >> SV.list
echo "MantaDEL" >> SV.list
echo "MantaBND" >> SV.list
echo "MantaDUP" >> SV.list
echo "Canvas:REF" >> SV.list
...

Then do,

 cat SV.list | while read i do;
         grep $i your VCF.file > ${i}.sub.vcf
done

There are thousands more alternative for doing this!

ADD REPLY • link 5.2 years ago by reza.jabal ▴ 580

0

Entering edit mode

Thanks Reza! when using grep my "your VCF.file" is in fact a text file with the file paths to each file: looking like this (just s sample):

genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz
genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz
genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz
genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz

genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz

Will grep work in this case?

Many thanks

Omid

ADD REPLY • link 5.1 years ago by tacrolimus ▴ 140

1

Entering edit mode

I see! I was under impression that you have a multi-call vcf file containing all variants across your samples. In that case you need to generate a multi-call vcf. You can use either vcftools or GATK but I recommend GATK: https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one

The syntax for your purpose would be along these lines:

 java -jar GenomeAnalysisTK.jar \
   -T CombineVariants \
   -R reference.fasta \
   --variant genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz \
   --variant genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz \
   --variant genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz \
   --variant genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz \
   --variant genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz \
   .
   .
   -o output.vcf \
   -genotypeMergeOptions UNIQUIFY

ADD REPLY • link 5.1 years ago by reza.jabal ▴ 580

0

Entering edit mode

Thanks again Reza for taking the time to reply. I actually have a multi-callVCF file already but was more interested in looking at SV variation between my cohort so thought generating individual patient VCF files by type of SV would be informative.

Many thanks

Omid

ADD REPLY • link 5.1 years ago by tacrolimus ▴ 140

1

Entering edit mode

So grep should work absolutely fine with your vcf. Lets assume your multi-call VCF file is: output.vcf

cat SV.list | while read i do;
         grep $i output.vcf > ${i}.sub.vcf
done

should do the trick!

Also have a look at SnpEff "Primary variant annotation and quality control": http://snpeff.sourceforge.net/protocol.html

ADD REPLY • link 5.1 years ago by reza.jabal ▴ 580