Filtering of low coverage variants from VCF
2
2
Entering edit mode
8.7 years ago

I would like to use GATK to obtain the distribution of variants, but I have some calls with low support, for example, I do not trust a call that tells me that it was founded in 2 reads. I can use simple coverage-based filter (say, depth for more than 20 bases is good and less is bad), but I am sure that more efficient strategies exist (that takes into account qualities of this base in reads, etc). Could you tell me how to filter low-quality variants?

It is not for variant calling so I do not care about impact and (I guess) can not use snpSift. I do not need really high accuracy, I just want my distribution of genome-wide variants to be noise-free.

I know that it looks like a newbie question but it is. I am completely new in variant calling.

vcf • 9.0k views
ADD COMMENT
0
Entering edit mode

Hello How and with what scripts can I apply the following filters in a file that includes all variants of the genome? Please explain in detail

Variants with phred-scaled scores below 20 and variants with genotypic qualities (GQ) of less than 20, SNPs within 5 bp of an indel, indels within 10 bp of each other, variants with a depth of coverage below 33% or more than twice mean genome coverage of the alignment

ADD REPLY
1
Entering edit mode

Please explain in detail

NO. This is not an appropriate way to ask for help - you are demanding help, which makes no one want to give you their time. Also, you've added your question as an answer to a 8 year old question - why did you do that? Did you familiarize yourself with the etiquette of the forum then create your post or just added your post with no regard for the proper way to do anything here?

I'm moving your post to a comment for the moment.

ADD REPLY
2
Entering edit mode
8.7 years ago

VCFtools (available here) allows filtering on a variety of user-defined variables (e.g., read depth, quality, allele frequency).

ADD COMMENT
0
Entering edit mode

Thank you...what threshold would you recommend for average coverage 30x? Which of the parameters are most important? As a mathematician, I would use only QUAL (likelihood) as a measure, even without depth, but is it OK?

ADD REPLY
1
Entering edit mode

The correct parameters are dependent upon your sample (species? haploid/diploid? individual/pooled?) and your experimental objective (is sensitivity more/less important than specificity? are the variants germ line or somatic?). Also, is there a way to independently validate the results, to fine-tune the parameters?

GATK offers guidelines for best practices using truth sets (here) or hard filters (here).

Personally, I use FreeBayes instead of GATK, followed by Heng Li's strategy of filtering low-complexity and high-coverage regions (here), then a depth of three independent reads (after removing duplicates), and finally filter against a blacklist if available. For my application, it strikes the appropriate balance between sensitivity and specificity. YMMV.

ADD REPLY
0
Entering edit mode

Dear All, I am having similar problem in determining the appropriate filtering criteria for the snps depth and quality. I am just a beginner in using vcftools for snps call filtering.How do i translate the hard filters below using vcftools: that is QualByDepth (QD) 2.0 FisherStrand (FS) 60.0 RMSMappingQuality (MQ) 40.0 MappingQualityRankSumTest (MQRankSum) -12.5 ReadPosRankSumTest (ReadPosRankSum) -8.0 StrandOddsRatio (SOR) 3.0

is this standard filter criteria for any vcf file format? If I am to use vcftools how do I specify the flag options?

ADD REPLY
1
Entering edit mode
8.7 years ago
skbrimer ▴ 740

If you are using GATK, then you might be using Picard Tools as well, if you are it has a "FilterVCF" tool that will do what you want. You can find the documentation here, it looks like if you call Min_DP=20 it will do what you want.

ADD COMMENT
0
Entering edit mode

Yes, thanks, but that's not what I want to do. I would like to find a strategy for removal of low-quality variants, depth of coverage seems too rough to be really efficient. For example, coverage 40 to ref allele and 5 to alternative will pass this filter, but together with low quality of bases of alternative allele it may indicate false positive.

ADD REPLY
0
Entering edit mode

Using GATK you can use the SelectVarient filter and like this

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -select "QUAL > 10"

Or which ever number threshold you want.

ADD REPLY
0
Entering edit mode

Thank you...the problem is that I do not know the threshold that I want =( also I have a huge amount of data and would like to filter variants on-line (so do not even put unreliable results to the resulting vcf file after first GATK lunch)...I know that I want too much, but it would save a lot of memory and time...especially this re-writing from input.vcf to output.vcf.

ADD REPLY
0
Entering edit mode

I am also interested in finding such a strategy. Were you able to find one?

ADD REPLY

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6