Summery: my grep command is not working properly when applied to a vcf but worked fine on a dummy test file. The grep command is putting too many records..
I am trying to pull all records with 1000 Genomes AF < 0.5 from a vcf. the vcf is annotated and for each SNP the AF from 1KGenomes is under an info column "controls_AF_popmax"
This is the surrounding area from an entry:
;non_cancer_AF_popmax=0.0004;controls_AF_popmax=0.0003;MCAP13=.;
This is my grep:
zless my_file.vcf.gz | grep -v '^#' | grep ';controls_AF_popmax=0\.0[0-4]\|;controls_AF_popmax=\.;' > output.txt
It is pulling records where the AF value is between 1 and 2.338e-05 and the "."
I tried a test .txt and the function worked well:
fadsfad;controls_AF_popmax=0.0003;adsfadsf
dsafdsaf;controls_AF_popmax=.;fadsf
fdasfasd;controls_AF_popmax=0.1;fasdf
where the result is:
fadsfad;controls_AF_popmax=0.0003;adsfadsf
dsafdsaf;controls_AF_popmax=.;fadsf
I don't see the problem here, where does it fail?
when i reintroduce the header and query the file, the controls_AF_popmax are all possible numbers between 1-NA
I strongly recommend using existing programs for filtering vcf files, like bcftools or SnpSift.
This command is not functioning properly for me.
We are talking about it on: BCF Tools Filter on 1000Genomes Annotation