- Why should I filter reads before mapping (with Bowtie2) besides saving computational resource / time?
- If I use local alignment in Bowtie2, will also reads with perfect (or very good) global match be aligned locally in addition? (I do not want to get multiple mappings just due to local aligning if a global alignment exists.)
More info:
I am analysing ChIP-seq data for Arabidopsis thaliana (i.e. small genome).
I have some reads datasets of generally good quality but some base qualities approach zero, of course.Distribution of nucleotides in some datasets also shows slight imbalance at the beginning of the reads (maybe parts of adapter or bar code sequences?).
After playing around with different ways of filtering, I would like to have all reads of the same length (better for downstream programs).
Can you give more information about the "filtering" in question #1? There are lots of ways you can filter your reads, and there are different reasons and considerations surrounding the choice of filter(s).
I mean filtering in general. Why not just running bowtie2 on the raw dataset? (I do not mind extra running time for the c. 10 % reads that would be discarded.) It is a ChiP-seq analysis, not SNP or enything else.
Well, again, it depends on the specifics of the experiment. But a lot of times you'll want to trim instead of filter. For example, trimming off low-quality bases or adapters can prevent a read from being ditched or penalized during alignment, allowing you to capture data that you would otherwise be throwing out.
Right, but as I wrote I prefer keeping the reads of the same length (some downstream programs might have problems with variable length of the reads). Then I would have to either trim the whole dataset (and loose some nucleotides needed for correct mapping) or discard (filter) such reads.
That is why I am thinking about combining global and local alignment (when potential adapters and low quality ends should not matter).
But it is generally recommended to do quality filtering before mapping (even in ChIP-seq guidelines) but I cannot see why. (e.g. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data)
The way bowtie is written, you can't combine --end-to-end and --local modes. But as Istvan shows, it's as easy as filtering out non-primary alignments using samtools or Picard's FilterSamReads tool.
A better question when it comes to quality filtering, in my opinion, is "why not do quality filtering?" Why wouldn't you want to remove garbage reads beforehand, thus ensuring they don't end up as false positives and take up computational time? Computationally, it requires much less work to throw out a nasty read than to attempt to map it.
Because I do not feel good about throwing away reads that may be of a very good quality and only have poor ends or throwing away ends from all reads.
Simply: I want to maximize the outcome.
What do you mean by "you can't combine
--end-to-end
and--local
modes"? You mean running bowtie2 twice and merging the outputs?What I mean is that you can't do what you're wanting to do in a single bowtie2 run. You can't run it in --end-to-end mode and have it only report global alignments for a given read while ignoring local alignments. There are many ways around this using multiple steps, some of which you've proposed yourself, but I don't know of any way to do it in a single pass.