I have an experimental genome in which many bases are ambiguous (e.g. "R" in places where we can't tell whether the base is "A" or "G", but we know it's not C or T etc. --We use the standard IUPAC codes from here). Let's call this "Experimental Genome" EG. I made a copy of this genome and replaced all ambiguity characters with "N",s --this destroys some information, but I thought it might reduce complexity of alignment somewhere, I'll call this genome EGN.
When I tried mapping some bisulfite-treated reads to these reference genomes with bsmap or bismark I was getting quite low mapping efficiencies (~25%), and from this post I learned that anything based on bowtie can't handle ambiguity characters. For this reason I switched to BWA-mem and got much higher alignment efficiencies (97%) against EGN --in fact, a bit _too_ high to not be suspicious. I'm wondering if maybe I was being too permissive (since anything can be aligned against a string of N's)?
Can anyone direct me to some official documentation on how aligners like BWAmem (or perhaps another) handle ambiguity characters? e.g. if an "A" from my read is aligned to an "M" (or N) in my genome is that considered a perfect match, or is there some penalty? Can it handle _all_ ambiguity characters, or just "N"s?
This would be useful for me to decide which aligner to use, and whether to use EG (which is as specific as we can be at each position, but involves 15 unique characters) or EGN (which is less specific, but only has the additional 5th character "A,C,G,T, and N").
Thanks for any help you can offer.
It looks like
bwa
can't use ambiguous bases since the answer you linked in the thread above was from author ofbwa
(A: Mapping To Genome With Ambiguous Reference Characters (R,Y,K,M,S,W Etc.) ).There does not appear to be an actively maintained NGS aligner (I could be wrong) that can handle IUPAC codes. You could use
blast+
(you will need to convert your reads into fasta format).Thanks for the reply. Can you elaborate a bit on "can't use" -- I mean: every aligner allows some mismatches, right? so if I have a read with
...CAT...
and a reference genome with...CGT...
, then that's 1 mis-match. But if my reference genome has...CNT...
, does that get counted as 1 mismatch, or is the whole alignment process screwed up so that nothing gets aligned at all? (obviously the ideal would be to count it as a match, but assuming that's not possible...).I thought you were asking if an aligner can align using all possible variations of a IUPAC character. If you are looking for one base pair difference like above then that is different. Instead of
N
you could consistently replace IUPAC character with one possible base and then keep track alignment to that.That was, my initial question, but as the next-best option I took my reference Genome (EG) and made a copy (EGN) with all of the other ambiguity characters (like R,Y,S, etc.) substituted with just "N". So now my new reference Genome (EGN) contains only 5 characters: A,C,G,T, and N (There are too many of this unresolved positions to cycle through all of them).
I'm looking for documentation describing how the alignment of a read sequence "GAT" against a genome sequence of "GNT" will be treated. Does BWAmem consider it to be (1) a perfect alignment? (2) an alignment with a single-mismatch? or (3) no alignment at all?
Thanks again.
I guess maybe I didn't phrase my question clearly, but I'm looking for some kind of documentation from BWAmem, (or any aligner at this point) describing how they handle ambiguity characters.
If this is my reference Genome:
And this is my read:
Does this alignment count as:
(1) a perfect alignment (2) an alignment with a single mis-match, or (3) impossible because nothing can be aligned against N at all.
Has anybody answered this question anywhere?
I don't know if documentation exists but
N
are generally considered "no call" in NGS world i.e. the software could not call a base with any certainty.bwa
seems to ignore N's in reference when aligning. This is a limited example so please test with additional examples.Created a reference with some N's (all records truncated to save space)
Fake reads made
Aligned using
bwa
I didn't quite follow the logic of what you did there, but it seems like the conclusion is (1) --that "A" against "N" is counted as a valid alignment and not a mismatch ... which is perfect! thanks very much.
In CIGAR string it does indicate
70M
but in the SAM tag it tells you that there is one base off (NM:i:1
), test on a larger example (with consecutive N's if that applies) and confirm.