Hi all,
Recently, I am dealing with the exome-seq data to call variants using bwa+GATK+varscan method, which is commonly accept by researchers.
As pointed in GATK forum, we can now use bwa mem command to replace the former bwa aln + sampe step to directly map the reads to the ref genome.
Now I am told by two ways: 1) directly using mem without the trimming process, and they told me the mapping quality is good enough. 2) we should firstly run the fastqc to determine whether the reads quality is good or not, then use fastx or cutadapt tool to trim the reads for better alignment.
So, is there anyboby has came across such a issue that you can share me some tips?
Thanks!
Trimming is generally a good idea. It makes the alignment/variation calling steps both faster and slightly more reliable.
Reference? Saw a few papers go by on this, with a few recommending only light trimming. See, e.g.: http://biorxiv.org/content/biorxiv/early/2013/11/14/000422.full.pdf which is looking at transcript assembly, but recommends:
The author of that paper goes over another study on trimming here: http://genomebio.org/is-trimming-is-beneficial-in-rna-seq/
That over aggressive trimming proves deleterious isn't exactly surprising to me. The importance of trimming and how stringently one should do it is dependent on (1) the length of reads (2) the type of experiment and (3) the aligner used and its options. For example, aligners using an end-to-end alignment (e.g. bowtie2 by default) will be more susceptible to mapping errors (or producing alignments with unduly low alignment scores) due to not trimming low quality read ends or read ends containing adapter contamination. For many applications (RNAseq or SNP calling are good examples) this really won't make a big difference. I don't disagree with Matthew MacManes there and find people suggesting trimming at Phred=30 or always lopping the last 25bp of reads off to just be silly. For other applications (I'm doing a lot of bisulfite-sequencing these days), this can really screw things up (though much less so for Bison, the aligner I wrote, than Bismark). As reads get longer and we start transitioning to local alignment and downstream analyses that account for base-call quality I expect the importance of quality trimming to wain significantly.
Anyway, good to see that someone's trying to put some objective numbers to the process!
Edit: I should add that I include adapter trimming in the general step of trimming. It seems that Matthew MacManes is in favor of continued adapter trimming, though I'm sure many people take that over the top too.
These are interesting links. The first paper appears to only use Trimmomatic, and the second link suggests you'll get very different results depending on the tool, so I wouldn't call this conclusive. I think the quality of the data and the research question should dictate how to treat the reads and coming up with a universal trimming rule shouldn't be the goal.
Also, if you look closely (following the second link), you'll see that mappability does decrease with moderate trimming for all except FASTX and PRINSEQ. I think the others must use the Phred algorithm, which leads to a decrease in mapping percent with increasing threshold. I would expect seqtk to display the same pattern since it uses this algorithm also.