How are ambiguity characters handled in BWAmem?
0
0
Entering edit mode
4.2 years ago
Bosberg ▴ 50

I have an experimental genome in which many bases are ambiguous (e.g. "R" in places where we can't tell whether the base is "A" or "G", but we know it's not C or T etc. --We use the standard IUPAC codes from here). Let's call this "Experimental Genome" EG. I made a copy of this genome and replaced all ambiguity characters with "N",s --this destroys some information, but I thought it might reduce complexity of alignment somewhere, I'll call this genome EGN.

When I tried mapping some bisulfite-treated reads to these reference genomes with bsmap or bismark I was getting quite low mapping efficiencies (~25%), and from this post I learned that anything based on bowtie can't handle ambiguity characters. For this reason I switched to BWA-mem and got much higher alignment efficiencies (97%) against EGN --in fact, a bit _too_ high to not be suspicious. I'm wondering if maybe I was being too permissive (since anything can be aligned against a string of N's)?

Can anyone direct me to some official documentation on how aligners like BWAmem (or perhaps another) handle ambiguity characters? e.g. if an "A" from my read is aligned to an "M" (or N) in my genome is that considered a perfect match, or is there some penalty? Can it handle _all_ ambiguity characters, or just "N"s?

This would be useful for me to decide which aligner to use, and whether to use EG (which is as specific as we can be at each position, but involves 15 unique characters) or EGN (which is less specific, but only has the additional 5th character "A,C,G,T, and N").

Thanks for any help you can offer.

sequence alignment genome • 2.6k views
ADD COMMENT
0
Entering edit mode

Can anyone direct me to some official documentation on how aligners like BWAmem (or perhaps another) handle ambiguity characters?

It looks like bwa can't use ambiguous bases since the answer you linked in the thread above was from author of bwa (A: Mapping To Genome With Ambiguous Reference Characters (R,Y,K,M,S,W Etc.) ).

There does not appear to be an actively maintained NGS aligner (I could be wrong) that can handle IUPAC codes. You could use blast+ (you will need to convert your reads into fasta format).

ADD REPLY
0
Entering edit mode

Thanks for the reply. Can you elaborate a bit on "can't use" -- I mean: every aligner allows some mismatches, right? so if I have a read with ...CAT... and a reference genome with ...CGT..., then that's 1 mis-match. But if my reference genome has ...CNT..., does that get counted as 1 mismatch, or is the whole alignment process screwed up so that nothing gets aligned at all? (obviously the ideal would be to count it as a match, but assuming that's not possible...).

ADD REPLY
0
Entering edit mode

I thought you were asking if an aligner can align using all possible variations of a IUPAC character. If you are looking for one base pair difference like above then that is different. Instead of N you could consistently replace IUPAC character with one possible base and then keep track alignment to that.

ADD REPLY
0
Entering edit mode

That was, my initial question, but as the next-best option I took my reference Genome (EG) and made a copy (EGN) with all of the other ambiguity characters (like R,Y,S, etc.) substituted with just "N". So now my new reference Genome (EGN) contains only 5 characters: A,C,G,T, and N (There are too many of this unresolved positions to cycle through all of them).

I'm looking for documentation describing how the alignment of a read sequence "GAT" against a genome sequence of "GNT" will be treated. Does BWAmem consider it to be (1) a perfect alignment? (2) an alignment with a single-mismatch? or (3) no alignment at all?

Thanks again.

ADD REPLY
0
Entering edit mode

I guess maybe I didn't phrase my question clearly, but I'm looking for some kind of documentation from BWAmem, (or any aligner at this point) describing how they handle ambiguity characters.

If this is my reference Genome:

...GNT...

And this is my read:

...GAT...

Does this alignment count as:

(1) a perfect alignment (2) an alignment with a single mis-match, or (3) impossible because nothing can be aligned against N at all.

Has anybody answered this question anywhere?

ADD REPLY
1
Entering edit mode

I don't know if documentation exists but N are generally considered "no call" in NGS world i.e. the software could not call a base with any certainty.

bwa seems to ignore N's in reference when aligning. This is a limited example so please test with additional examples.

Created a reference with some N's (all records truncated to save space)

$ grep ACAACTAATGGTGACTTTTTGCATTTCT NC_045512.fa
ACAACTAATGGTGACTTTTTGCATTTCTTACC**N**AGAGTTTTTAGTGCAGTTGGTAACATCTGTTACACAC

Fake reads made

$ grep -A 3 SYN_5_533_632_0_+_8533_1_._NC_045512.2 check.fq 
@SYN_5_533_632_0_+_8533_1_._NC_045512.2 1:
ACAACTAATGGTGACTTTTTGCATTTCTTACCAAGAGTTTTTAGTGCAGTTGGTAACATCTGTTACACAC
+
6275;7937<<?5:=<:=A<:;8@96<:9;<6?=99=;>:=;7B9768@:88449<4>7?A>:>3::<84

Aligned using bwa

SYN_5_533_632_0_+_8533_1_._NC_045512.2  0       NC_045512.2     1891    60      70M     *       0       0       ACAACTAATGGTGACTTTTTGCATTTCTTACCAAGAGTTTTTAGTGCAGTTGGTAACATCTGTTACACAC  6275;7937<<?5:=<:=A<:;8@96<:9;<6?=99=;>:=;7B9768@:88449<4>7?A>:>3::<84  NM:i:1  MD:Z:32G37      AS:i:65 XS:i:0
ADD REPLY
0
Entering edit mode

I didn't quite follow the logic of what you did there, but it seems like the conclusion is (1) --that "A" against "N" is counted as a valid alignment and not a mismatch ... which is perfect! thanks very much.

ADD REPLY
0
Entering edit mode

In CIGAR string it does indicate 70M but in the SAM tag it tells you that there is one base off (NM:i:1), test on a larger example (with consecutive N's if that applies) and confirm.

ADD REPLY

Login before adding your answer.

Traffic: 1729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6