Remove line with awk in vcf.gz
2
Hello everyone.
I have a vcf.gz file and I want to filter the columns that contain "." .
I did this zcat file.vcf.gz | grep -v "#" | awk ' $4=="." || $5=="." '
.
however I don't know how to delete them and save the new file in vcf.gz format .
Thank you for your help.
SNP
• 4.1k views
Don't use grep/awk/... to filter vcf files. Instead use programs that are specialized on doing this, like bcftools.
Column 4 and 5 are the REF and ALT column. So you like to exclude all rows that have no value there:
$ bcftools view -e "REF=='.'||ALT=='.'" -o output.vcf.gz input.vcf.gz
zcat file.vcf.gz | awk '$1 ~ /^#/ {print $0;next} {if ($4 == "." || $5 == "." ) print }' | bgzip > new.vcf.gz
Will print all entries where $4 or $5 is .
to a new compressed VCF file (bgzip
for compression) preserving the header lines starting with #
.
$1 ~ /^#/ {print $0;next}
essentially means that if the line starts with #
then print it (to preserve header lines).
{if ($4 == "." || $5 == "." ) print }
tests if $4 or $5 is .
and prints the entire row if true.
If you wanted entries with no .
in either of the columns, that would be ($4 != "." && $5 != "." )
Edit: Agree with finswimmer that specialized tools such as bcftools
are preferred to avoid any possible file corruption.
Login before adding your answer.
Traffic: 1778 users visited in the last hour
Thank you very much for your answer Yes my objective was to delete the lines containing ".'' So I will use instead
($4 !="." && $5 !=".").