So I have some very long RNA-seq reads (250nt) and I thought of upping the number of allowed mismatches. 2 is the default I used for 100nt but I thought of going to 5 for 250nt reads (1mismatch/50nt). I will be using Tophat to map these reads.
I am concerned Tophat would put >=3 mismatches in a row (nt next to each other) and I would like to stop that from happening so what I would like to know is if Tophat (and Bowtie) would map such a read and if so, are there any changes in its settings to stop that (other than keeping the mismatches set to 2 as default)?
If I cannot stop this, is there an easy to way to filter such reads out from the BAM/SAM file?
Many thanks,
James
Thank you for your reply. I am trying to compare what you said with the options in the Tophat manual. I have selected these options for my run allowing for 5 mismatches. Do you know if these setting will prohibit >2 mismatches in a row? I think the read-gap-length will do this but I am not sure if read-edit-dist interferes with that.
--read-mismatches 5
(default 2)--read-gap-length
(left as default 2)--read-edit-dist 5
(default 2; I had to change it to 5 when I increased read-mismatches)Thanks again,
James
I think you are overly hung up of the being afraid of two mismatches in a row. Imagine that your data actually comes from a sample that actually has two mismatches in a row - why would you not want that to be reported correctly? It would scientifically be inappropriate to forbid this to happen a priori. In general it is rare to get multiple mismatches in a row by accident since the mismatch penalties are typically higher than gap open + extension so some other alternative alignment will be found. But I would recommend to move on and stop being concerned about something that rarely happens and when it does happen is probably correct anyhow.
I am fine with 2 mismatches in a row. It is more than 2 mismatches in a row that troubled my lab so I am trying to see if my mapping approach as described above would stop 3 or more mismatches in a row. I also do not think it would be bad to have some rare cases where there are multiple mismatches in a row but wanted to understand what I had done better and to see if this fear my lab had was even real which I am still having trouble seeing if it is.
2 or 3 or 4 or 5 makes no difference - and like I said the way is not to filter out or forbid it from happening - if your aligner reports three mismatches in a row than it means that is the most likely alignment based on what parameters you have set. And that's that, the way around it is not to filter out just this one thing but allow all others. It would be pretty absurd (and bad science) to remove three mismatches in a row but allow three mismatches as long as there is one base separation between each mismatch. This latter is a far more suspicious alignment IMO.
It is a very good point and I would not want to bias my analysis in an unfair way. Thanks for the advice.