Hi,
to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.
I figured out that
- long indels have low quality
- only substitutions do not provide enough variants for my analysis.
I know how to filter out indels using bcftools - is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don't know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.
Appreciate any help, quick googling did not solve the problem.
UPD: My ugly solution based on Ram's comment:
zcat final_all_merged.vcf.gz | grep "#" > only_short_indels.vcf
zcat final_all_merged.vcf.gz | awk 'length($5) + length($(4)) < 4' >> only_short_indels.vcf
gzip only_short_indels.vcf
I believe Pierre's solution will also work, just too lazy to install additional toolkit on cluster...
UPD1: one liner
zcat final_all_merged.vcf.gz | awk '($1 ~ /^#/ || length($5) + length($(4)) < 4)' | gzip > only_short_indels.vcf.gz
Google is your friend here. One of the top results when I searched for
nchar awk
is this link: https://stackoverflow.com/questions/16613854/remove-lines-based-on-number-of-characters which shows a function calledlength
, which I think you should be able to use like so:length($5)-length($4) > 3
(and add in something there to get the absolute value of the difference).shame on me, should work indeed!