Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:
Here is an example:
cat test.vcf
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,1 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,3 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
Greater than 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
Less than 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,1 0/0:29,4
Greater than or equal to 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,3 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
In these examples, I use AWK's split
function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if
statement. The first part of the AWK command will ensure that the VCF header is output.
Kevin
Hi Kevin, thanks for replying (+1).
Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.