I have VCF variants files, can anyone provide me a list of tools for variant filtering? thank you
I have VCF variants files, can anyone provide me a list of tools for variant filtering? thank you
GATK has a tool "SelectVariants" that has some standard filter options and you can create filter expressions based on the attributes in the vcf records: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html
You can also use SnpSift to build filter expressions on the standards vcf attributes and the ones added by SnpEff effect prediction: http://snpeff.sourceforge.net/SnpSift.html
http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html
vcf objects are basically sample-subsettable granges - a very clever implementation
If you want a quick and highly customizable way to filter vcfs, try perl one-liners.
perl -lne 'print $_ if ($_ =~ /0\/1/)' < my_vcf_file.vcf > filtered_vcf_file.vcf
will get you all the variants where the genotype has been called as "0/1". In English, this one-liner says "print the line if the line contains the string "0/1".
perl -lane 'print $F[5] if ($_ !~ /^#/)' < my_vcf_file.vcf > QUAL_scores.txt
will get you a list of all the QUAL scores. In English, this says "print the value in the sixth column if the line does not start with a # character".
My favorite perl one-liner guide is here. A one-liner is no replacement for a proper filtering script, but for getting a sense of the distribution of your data there's nothing better.
While it imports your VCF into a database first, our GEMINI software is specifically designed to allow filtering of variants in VCF files based on genome annotations and sample genotypes.
See Gemini: Integrative Exploration Of Genetic Variation And Genome Annotations thread. Also, please see the documentation.
An example of a GEMINI query filtering variants based on allele frequency and functional impact:
$ gemini query -q "select * from variants \
where is_lof = 1 \
and aaf >= 0.01" my.db
Extend this to further filter based on sample Thelonius being a heterozygote
$ gemini query -q "select * from variants \
where is_lof = 1 \
and aaf >= 0.01"
--gt-filter "gt_types.Thelonius == HET" \
my.db
Hi,
You can also look at the extension of Plink! that manages VCF files: http://atgu.mgh.harvard.edu/plinkseq/overview.shtml
Best
Filter using javascript: https://github.com/lindenb/jvarkit#-filtering-vcf-with-javascript-rhino-
/** prints a VARIATION if two samples at least
have a DP<200 */
function myfilterFunction()
{
var samples=header.genotypeSamples;
var countOkDp=0;
for(var i=0; i< samples.size();++i)
{
var sampleName=samples.get(i);
if(! variant.hasGenotype(sampleName)) continue;
var genotype = variant.genotypes.get(sampleName);
if( ! genotype.hasDP()) continue;
var dp= genotype.getDP();
if(dp < 200 ) countOkDp++;
}
return (countOkDp>2)
}
myfilterFunction();
.
$ gunzip -c file.vcf.gz |\
java -jar dist/vcffilterjs.jar SCRIPT_FILE=filter.js
The snpSift package has snpSift filter
operation that is quite powerful and performant.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What about vcftools? http://vcftools.sourceforge.net/
Tony: For general question like this you should first go through similar questions in Biostar. You can easily search them using the search button. Only if you don't find a good or satisfying answer, you should post a question. Thanks.