Question

Filters/Quality For Vcf Generated By Freebayes

17

Entering edit mode

11.6 years ago

Adrian Pelin ★ 2.7k

Hello,

I was reading a few papers citing FreeBayes because I am curious as to what people treat as real variants, and what they do not treat as such.

I myself use illumina and am interested in finding out: A) Is it best to trim your raw reads on the edges for bad quality? B) Remove reads below a certain overall quality? C) Remove PCR duplicates? D) Normally when variants are called, coverage also matters. What are the coverage thresholds (minimum and maximum) to use when determining if a call is real or artefactual? E) Lastly, what is the quality threshold one should consider when treating calls? Where do we draw that line where below it the call is an artifact and above it, it's potentially real?

My own thoughts: For A and B, I am usually against trimming and filtering out RAW reads, because I am afraid of the bias it might introduce. As far as removing PCR duplicates, I think that's a good idea, but it is tricky when dealing with paired end data, because both reads would need to be identical for it to be a PCR duplicate. These guys (1) seemed to remove PCR duplicated, filter and trim. These other guys (2) only trimmed Ns and adaptors, and fitlered only reads with a length smaller than 50, which I think is fine. The authors of the third paper (3) did not preprocess their data in any way.

For D, most people I see are using a minimum DP of 10 or 5 (1), which I am fine with, I think 10 is solid. But what about maximum DP? Either we are sequencing a population and we want to see allele frequency or we are sequencing an individual and want to see heterozygous sites, in any case we should be cautious about gene duplications, so a coverage too high will indicate multiple copies of the gene in the genome, making variant analysis tricky. I am interested in heterozygosity, so what I intend to do is take the average DP for all alleles and setting the upper DP threshold at 1.5x that amount. Opinions on this?

For E I see (1) and (2) used 80 and 100 as the magic line. Here I am really not sure, is 100 stringent enough, or do we need more?

Thank you for your time, Adrian

References: (1) http://www.biomedcentral.com/1471-2164/14/41?utm_source=dlvr.it&utm_medium=twitter (2) http://www.biomedcentral.com/1471-2164/14/64?utm_source=dlvr.it&utm_medium=tumblr (3) http://www.biomedcentral.com/1471-2164/14/467?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Bmc%2FGenomics%2FLatestArticles+%28BMC+Genomics+-+Latest+articles%29

vcf quality • 23k views

ADD COMMENT • link 11.5 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

A&B) Bad quality will affect the mapping and also calling, personally, I will perform trimming and filtering if the read quality is bad (or request for re-sequencing from the company)

C) Remove PCR duplicats are important About paired-end sequencing

D) Many people use 250, but I am not really sure about that.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.6 years ago by Sam ★ 4.8k

0

Entering edit mode

Have you looked into using Variant Quality Score Recalibration (VQSR)?

ADD REPLY • link 11.6 years ago by Sean Davis 27k

0

Entering edit mode

Just looked into it. Very nice idea, many criteria by which to filter, strand bias, alternate allele position... but it needs a prior, needs a training/truth data set. There would be no place where I could find that unfortunately.

The only thing I have is 6 populations, so some sites will be common among them. Maybe try and use that as a training/truth set.

ADD REPLY • link 11.6 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

With what species are you working, just out of curiosity?

ADD REPLY • link 11.6 years ago by Sean Davis 27k

score 21 · Answer 1 · 2013-11-23

minimal pipeline and filtering

You should get pretty far simply by marking duplicates. You may not need to do this if you didn't use PCR amplification in your library preparation, but it shouldn't hurt.

Brad Chapman has worked up a "minimal" pipeline, and he shows that it works basically as well as the much more compute-intensive GATK best-practices http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/.

He's filtering using DP and QUAL. See the comments in the post and referenced code for examples. In my experience, these should be sufficient. However, if you have some sense of what's true and false from an orthogonal method, you may get better results with a careful tuning of parameters. The allele balance (reported under AB) and strand bias counts (SRF, SRR, SAF, SAR) and bias estimate (SAP) can be used as well, as they are strongly correlated with certain classes of error.

The QUAL estimate provides the phred-scaled probability that the locus is not polymorphic provided the data and the model. This is reasonably-well calibrated, so you can specify that you want things where we expect error rates of no more than 1/100 (QUAL > 20) or 1/1000 (QUAL > 30).

trimming

I would be cautious about aggressively trimming low-quality bases from the ends of the reads. Provided the base quality values represent likely errors, you should do just as well leaving them in the analysis.

Moreover, detecting variants on haplotypes resolves many of the problems with using these lower-quality regions of the reads. Although true sequence variants cluster, errors cluster even more strongly. You can see this if you compare the variance to mean ratio (VMR) of distances between true variants (e.g. 1000G variants) to the distribution of distances of any mismatch between the reference and reads in a set of reads.

Errors being somewhat independent, we should expect that the error-ful ends of reads will generate haplotype observations that do not recur and thus rarely propagate up to the level of detection. We can observe this effect here: http://clavius.bc.edu/~erik/freebayes/figures/indel_error.png. In this figure, we've run the same 191 1000G samples with different --haplotype-length settings for freebayes, and measured the indel length-frequency spectrum to observe the rate of a particular 1bp insertion artifact caused by bubbles in flowcells during the Illumina sequencing process. At smaller window sizes, the chance of recurrence of this artifact in multiple reads is high enough that we detect more 1bp insertions than deletions (we expect slightly fewer given results from de novo assembly). As we increase the window size, the artifacts cluster with nearby errors, generating unique haplotype observations. This improves the signal-to-noise ratio.

In summary, I wouldn't trim unless you observe serious problems with the last cycles in your run, and these generate lots of false calls.

removing low-quality reads and bases

Again, I wouldn't remove reads with low mapping quality, or observations with "low" base quality unless you observe very strong systematic problems with your sequencing data. The basic idea is that the method benefits from additional information. Even low-quality information is helpful.

In the pipelines I'm running now, for Illumina data I typically provide the following to freebayes:

--min-base-quality 3
--min-mapping-quality 1

This is because (as I understand) base qualities below 3 act as flags rather than true qualities. Also, mapping quality 0 may mean "multiply mapped." I noticed these settings helped our results slightly.

caveats

Your specific experiment may require high sensitivity or high specificity. To that extent, you should choose appropriate filter settings based on your understanding of the problem. Hopefully this helps shed some light on your options.

score 0 · Answer 2 · 2013-12-07

0

Entering edit mode

11.5 years ago

Adrian Pelin ★ 2.7k

When mapping paired-end data to a genome, would anyone consider only mapping one read if the paired mate as well maps? In other words, Read/1 can only map if Read/2 also maps within the required distance?

ADD COMMENT • link 11.5 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Most read mappers do something like this, but rather than not aligning anything, the aligner will typically use flags to let the user know the status of the mapping pair.

ADD REPLY • link 11.5 years ago by Sean Davis 27k