Question

Filtering long indels from VCF

0

Entering edit mode

3.6 years ago

German.M.Demidov ★ 3.0k

Hi,

to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.

I figured out that

long indels have low quality
only substitutions do not provide enough variants for my analysis.

I know how to filter out indels using bcftools - is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don't know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.

Appreciate any help, quick googling did not solve the problem.

UPD: My ugly solution based on Ram's comment:

zcat final_all_merged.vcf.gz | grep "#" > only_short_indels.vcf
zcat final_all_merged.vcf.gz | awk 'length($5) + length($(4)) < 4' >> only_short_indels.vcf
gzip only_short_indels.vcf

I believe Pierre's solution will also work, just too lazy to install additional toolkit on cluster...

UPD1: one liner

zcat final_all_merged.vcf.gz | awk '($1 ~ /^#/ || length($5) + length($(4)) < 4)' | gzip > only_short_indels.vcf.gz

VCF • 3.8k views

ADD COMMENT • link updated 20 months ago by bari.ballew ▴ 480 • written 3.6 years ago by German.M.Demidov ★ 3.0k

1

Entering edit mode

Google is your friend here. One of the top results when I searched for nchar awk is this link: https://stackoverflow.com/questions/16613854/remove-lines-based-on-number-of-characters which shows a function called length, which I think you should be able to use like so: length($5)-length($4) > 3 (and add in something there to get the absolute value of the difference).

ADD REPLY • link 3.6 years ago by Ram 45k

0

Entering edit mode

shame on me, should work indeed!

ADD REPLY • link 3.6 years ago by German.M.Demidov ★ 3.0k

score 2 · Answer 1 · 2023-08-10

A bit late to the party, but I have a pretty succinct way to do this using bcftools. If you want only short indels, but no snps, try this:

bcftools view -i '(ILEN >= -5 && ILEN <= 5)' -Oz -o only_short_indels.vcf.gz final_all_merged.vcf.gz

And if you want short indels and snps, this:

bcftools view -i '(ILEN >= -5 && ILEN <= 5) || TYPE!="INDEL"' -Oz -o short_indels_and_snps.vcf.gz final_all_merged.vcf.gz

ILEN will count up the length of the indel for you: positive numbers are for insertions, and negative are deletions. The commands above will keep indels from +5 to -5 bases in length. TYPE is a builtin that will determine whether a variant is an indel, snp, etc. on-the-fly, so you can use the OR logic to include variants that aren't indels (i.e. are snps) using the second command. Note that otherwise, ILEN explicitly excludes anything that's not an indel. Happy hunting!

score 1 · Answer 2 · 2021-09-13

1

Entering edit mode

3.6 years ago

Pierre Lindenbaum 166k

how to print only variants where both ALT/REF variants are shorter than 3 symbols.

using http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getAlleles().stream().filter(A->!A.isSymbolic()).allMatch(A->A.length()<4&& A.length()>0);'

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 166k