Hello,
I was reading a few papers citing FreeBayes because I am curious as to what people treat as real variants, and what they do not treat as such.
I myself use illumina and am interested in finding out: A) Is it best to trim your raw reads on the edges for bad quality? B) Remove reads below a certain overall quality? C) Remove PCR duplicates? D) Normally when variants are called, coverage also matters. What are the coverage thresholds (minimum and maximum) to use when determining if a call is real or artefactual? E) Lastly, what is the quality threshold one should consider when treating calls? Where do we draw that line where below it the call is an artifact and above it, it's potentially real?
My own thoughts: For A and B, I am usually against trimming and filtering out RAW reads, because I am afraid of the bias it might introduce. As far as removing PCR duplicates, I think that's a good idea, but it is tricky when dealing with paired end data, because both reads would need to be identical for it to be a PCR duplicate. These guys (1) seemed to remove PCR duplicated, filter and trim. These other guys (2) only trimmed Ns and adaptors, and fitlered only reads with a length smaller than 50, which I think is fine. The authors of the third paper (3) did not preprocess their data in any way.
For D, most people I see are using a minimum DP of 10 or 5 (1), which I am fine with, I think 10 is solid. But what about maximum DP? Either we are sequencing a population and we want to see allele frequency or we are sequencing an individual and want to see heterozygous sites, in any case we should be cautious about gene duplications, so a coverage too high will indicate multiple copies of the gene in the genome, making variant analysis tricky. I am interested in heterozygosity, so what I intend to do is take the average DP for all alleles and setting the upper DP threshold at 1.5x that amount. Opinions on this?
For E I see (1) and (2) used 80 and 100 as the magic line. Here I am really not sure, is 100 stringent enough, or do we need more?
Thank you for your time, Adrian
References: (1) http://www.biomedcentral.com/1471-2164/14/41?utm_source=dlvr.it&utm_medium=twitter (2) http://www.biomedcentral.com/1471-2164/14/64?utm_source=dlvr.it&utm_medium=tumblr (3) http://www.biomedcentral.com/1471-2164/14/467?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Bmc%2FGenomics%2FLatestArticles+%28BMC+Genomics+-+Latest+articles%29
A&B) Bad quality will affect the mapping and also calling, personally, I will perform trimming and filtering if the read quality is bad (or request for re-sequencing from the company)
C) Remove PCR duplicats are important About paired-end sequencing
D) Many people use 250, but I am not really sure about that.
Have you looked into using Variant Quality Score Recalibration (VQSR)?
Just looked into it. Very nice idea, many criteria by which to filter, strand bias, alternate allele position... but it needs a prior, needs a training/truth data set. There would be no place where I could find that unfortunately.
The only thing I have is 6 populations, so some sites will be common among them. Maybe try and use that as a training/truth set.
With what species are you working, just out of curiosity?