Filtering Gatk Indel Output Length 20+
2
0
Entering edit mode
11.2 years ago
rob234king ▴ 610

I presume most simple way is to write a perl script but thought I would check not missing a filter option rather than go down this route and future use. I've got GATK SNP and INDEL output. Filtered Indels to another vcf but want to filter only Indel insertions of 20 or more to another vcf. Is there an option for this in vcftools or other GATK post filtering software i.e. snpsift? I can't seem to see it.

gatk filter vcftools • 7.7k views
ADD COMMENT
1
Entering edit mode
11.2 years ago
William ★ 5.3k

You can just stick to GATK. With the select variants tool you can create filter expressions to filter on any of the VCF record attributes.

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght20.vcf -select 'vc.getIndelLengths().0 > -20 && vc.getIndelLengths().0 < 20'
ADD COMMENT
0
Entering edit mode

getIndelLengths().0 returns the first element of the list isn't it ? what happens if there is more than one ALT allele ?

ADD REPLY
0
Entering edit mode

I would be curious about this too if someone has the answer

ADD REPLY
0
Entering edit mode

It works, thanks but I want a min length of 20 not max (could use --maxIndelSize for this too). I'll look at the doc and post what I use for 20 and above indels filtering. changing the "<" around doesn't produce any results even when I go down to 5 even though I know there are indels of 5+ and above.

ADD REPLY
0
Entering edit mode

Got it for insertions only that are larger than 19 and less than 1000 as this: -select 'vc.getIndelLengths().0 > 19 && vc.getIndelLengths().0 < 1000'

ADD REPLY
0
Entering edit mode
11.2 years ago

I wrote a tool filtering a vcf using a javascript expression: https://github.com/lindenb/jvarkit#-filtering-vcf-with-javascript-rhino-

example: filter.js

function accept()
    {
    var lengths=variant.indelLengths;

    if(lengths==null) return false;
    for(var i=0;i< lengths.size();i++)
        {
        var L=lengths.get(i);
        if(L>20 || L<-20) return true;
        }
    return false;    
    }

accept();

. Usage:

$ curl -s "http://homes.gersteinlab.org/people/lh372/VAT/1000genomes_pilot_indels.sample.vcf" | java -jar dist/vcffilterjs.jar SF=filter.js

output:

(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CEU:NA06985 CEU:NA06986 CEU:NA0699
1 802991 . C CAGCCCCCTGTGTACCCCAGGAGGCCAGA 100 PASS AC=4;AF=0.0462118;AN=120;CA=
1 811467 . T TGCCCTTTGGCAGAGCAGGTGTGCTGTGCTG 100 PASS AC=83;AF=0.640232;AN=120;C
1 831920 . A AAACTCAGCTGCCTCTCCCCTTC 100 PASS AC=57;AF=0.375763;AN=120;CA=0;DP=6
1 833078 . C CCTGCCCGGTCCTTCTGACCAGCCGAGAGAGTA 100 PASS AC=48;AF=0.319518;AN=120
1 947830 rs6143083;-/AGTCTGACCTGTGGTCTGACTGT,venter T TTGTAGTCTGACCTGTGGTCTGAC 1
1 987299 rs34515215;-/TCCCTCCCTCCCTCCCTTGTCCCCGT,mills CTCCCTCCCTTGTCCCCGTTCCCTC
1 1034351 . ACCACAGCCAAAAGGTGGGAGCAAGTGTCCAC A 100 PASS AC=79;AF=0.0501418;AN=23
1 1049375 . T TACACACCTGAGCACACACACCTGTGC 100 PASS AC=35;AF=0.0808078;AN=240;CA=
1 1150963 . CGCCACAGACACGGGCCACACACTCCACAT C 100 PASS AC=31;AF=0.111384;AN=358;C
1 1675781 . TCCCTGGGACCGAAGTCGCCCCA T 70.99 PASS AC=95;AF=0.270579;AN=358;CA=0;D
1 1767123 . TACACACACACACACACACACACACACACAC T 100 PASS AC=64;AF=0.205885;AN=358;
1 1859140 mills ACACCAGGTCCACCTCTGGACACAGGTCCACC A 100 PASS AC=124;AF=0.180806;A
1 1870233 . T TCCCTCCCTTCTTTCCTTCCCTTTCCCTC 100 PASS AC=54;AF=0.408049;AN=120;CA
1 1870305 . C CCTTTCCCTCCCTTACTCCTTCCTTCCTT 100 PASS AC=71;AF=0.121733;AN=358;CA
1 1967855 . ACCTTGGCGGGGTCAGGAGCTGCGGCTCCCAGGTGCCCCAG A 62.82 PASS AC=3;AF=0.030
1 1992350 rs58051546;-/TGACCTGGGCGTGTGACCTGGGCGTG,venter C CACCTGGGCGTGTGACCTGGG
1 2246469 . T TTCCCTCCCTTCCTTCCTTCCCTCCC 71.98 PASS AC=18;AF=0.0710328;AN=358;CA
(...)
ADD COMMENT
0
Entering edit mode

Thanks ill check it out in detail Tomorrow. I saw your tool for extracting 0 coverage regions from a genome. I was late night musing on how could use this. trying to think of how to mask regions of high coverage i.e. missplaced reads so second time round for mapping they are mapped to more likely locations? Or could you use this to focus on remapping to these regions single end by stripping reads from these high coverage areas. you use this to attempt to map to by stripping regions with massive coverage which is incorrect mapping.

ADD REPLY
0
Entering edit mode

Hi! I just was looking for this, and since is the top result in google search I found it useful to answer with the current GATK version (4.2.6.1).

What worked for me is this commands:

gatk SelectVariants \
     -V input.vcf \
     --select-type-to-include INDEL \
     --min-indel-size 20 \
     -O output.vcf

Those, and more, arguments can be found in: https://gatk.broadinstitute.org/hc/en-us/articles/5358856605339-SelectVariants#--variant

I used the docker image for GATK. I think it works fine and saves a lot of time, this is the tutorial I followed: https://gatk.broadinstitute.org/hc/en-us/articles/360035889991--How-to-Run-GATK-in-a-Docker-container#article-comments

Quick disclaimer: I'm new to bioinformatics so maybe there is some other way to do this filter using other tools, but I found GATK to be the most easy and straightforward.

Hope it helps!

ADD REPLY

Login before adding your answer.

Traffic: 1822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6