Filtering long indels from VCF
2
0
Entering edit mode
3.2 years ago

Hi,

to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.

I figured out that

  1. long indels have low quality
  2. only substitutions do not provide enough variants for my analysis.

I know how to filter out indels using bcftools - is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don't know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.

Appreciate any help, quick googling did not solve the problem.

UPD: My ugly solution based on Ram's comment:

zcat final_all_merged.vcf.gz | grep "#" > only_short_indels.vcf
zcat final_all_merged.vcf.gz | awk 'length($5) + length($(4)) < 4' >> only_short_indels.vcf
gzip only_short_indels.vcf

I believe Pierre's solution will also work, just too lazy to install additional toolkit on cluster...

UPD1: one liner

zcat final_all_merged.vcf.gz | awk '($1 ~ /^#/ || length($5) + length($(4)) < 4)' | gzip > only_short_indels.vcf.gz
VCF • 3.1k views
ADD COMMENT
1
Entering edit mode

Google is your friend here. One of the top results when I searched for nchar awk is this link: https://stackoverflow.com/questions/16613854/remove-lines-based-on-number-of-characters which shows a function called length, which I think you should be able to use like so: length($5)-length($4) > 3 (and add in something there to get the absolute value of the difference).

ADD REPLY
0
Entering edit mode

shame on me, should work indeed!

ADD REPLY
2
Entering edit mode
15 months ago
bari.ballew ▴ 470

A bit late to the party, but I have a pretty succinct way to do this using bcftools. If you want only short indels, but no snps, try this:

bcftools view -i '(ILEN >= -5 && ILEN <= 5)' -Oz -o only_short_indels.vcf.gz final_all_merged.vcf.gz

And if you want short indels and snps, this:

bcftools view -i '(ILEN >= -5 && ILEN <= 5) || TYPE!="INDEL"' -Oz -o short_indels_and_snps.vcf.gz final_all_merged.vcf.gz

ILEN will count up the length of the indel for you: positive numbers are for insertions, and negative are deletions. The commands above will keep indels from +5 to -5 bases in length. TYPE is a builtin that will determine whether a variant is an indel, snp, etc. on-the-fly, so you can use the OR logic to include variants that aren't indels (i.e. are snps) using the second command. Note that otherwise, ILEN explicitly excludes anything that's not an indel. Happy hunting!

ADD COMMENT
1
Entering edit mode
3.2 years ago

how to print only variants where both ALT/REF variants are shorter than 3 symbols.

using http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getAlleles().stream().filter(A->!A.isSymbolic()).allMatch(A->A.length()<4&& A.length()>0);'
ADD COMMENT

Login before adding your answer.

Traffic: 1491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6