Question

Any tool that aligns sequences that use degenerated nucleotide coding

0

Entering edit mode

7.4 years ago

dabid • 0

Hi,

I have short sequences that use degenerated nucleotide coding (such R, Y B, N, etc). I want to align these sequences to another longer sequence that uses only standard nucleotide (A, T, C , and G). As an example, the tool should align N to either A or T or C or G. So, it doesn't consider N as a different nucleotide from A or T or C or G. Do you know any alignment tool that consider the degenerated nucleotide coding?

Thanks!

alignment dna degenerated coding sequence • 7.6k views

ADD COMMENT • link updated 5.1 years ago by Wayne ★ 2.1k • written 7.4 years ago by dabid • 0

0

Entering edit mode

I think you're best off to write something to handle this.

IUPAC nucleotide code   Base
A   Adenine
C   Cytosine
G   Guanine
T (or U)    Thymine (or Uracil)
R   A or G
Y   C or T
S   G or C
W   A or T
K   G or T
M   A or C
B   C or G or T
D   A or G or T
H   A or C or T
V   A or C or G
N   any base
. or -  gap

ADD REPLY • link 7.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

It is odd that your short reads have ambiguities, why is that? Is it a bad quality sequencing you are trying to salvage?

There are programs which can handle degenerated bases on the reference, would this help you?

ADD REPLY • link 7.4 years ago by h.mon 35k

0

Entering edit mode

Can you please explain to me what do you mean by handle degenerated bases on the reference. Can it consider these degenerated bases in the alignment as matches and not as mismatches -all the time?

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

Can it consider these degenerated bases in the alignment as matches and not as mismatches -all the time?

Yes, BWBBLE does exactly that. And HISAT2 can index a reference genome + SNPs.

However, you didn't answer why your read have ambiguities. I get the feeling if you tell us exactly what you want to do you will get better answers.

ADD REPLY • link 7.4 years ago by h.mon 35k

0

Entering edit mode

It's primer sequences and I need to align them to some database.. Thank you for your suggestions, I will look at them closely and see it works for my case

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

5.1 years ago

Wayne ★ 2.1k

The letters that PatMatch patterns allow match the IUPAC ambiguity codes as well. See the syntax nicely represented at the Saccharomyces Genome Database's PatMatch tool portal.

For more about the tool and various places to use it, I have a summary and list here. On that page you can also launch PatMatch to run in your browser for examining any sequence you can provide. Use the CyVerse offering if you need more storage than the MyBinder system allows.

I have a couple of Python based utility scripts made for working with PatMatch available here. One, patmatch_results_to_df.py, converts the output to a dataframe that is easily analyzed further or saved as a text table that can be opened in any spreadsheet software. The other, matches_a_patmatch_pattern.py, will tell if a provided sequence contains a match to a pattern containing ambiguous/degenerate codes. Those scripts are demonstrated in the browser-based system launchable here.

ADD COMMENT • link 5.1 years ago by Wayne ★ 2.1k

score 2 · Accepted Answer · 2017-07-24

2

Entering edit mode

7.4 years ago

Jean-Karim Heriche 27k

Blastn and exonerate understand IUPAC codes.

ADD COMMENT • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I have tried Blastn, but it doesn't understand IUPAC codes unless there is some option that I need to set up and I am not aware of it.. Please, let me know.. For exonerate, I am not aware of it, I will have a look at it shortly.

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

Can you show what you've tried ?

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I am not sure if I can share my files here in this platform. I can email it to you if you want.. But, anyway, I am sharing a case of the blastn results below: GCCTNAGGC GCCTCTGGC with "align_len": 9 and "identity": 7

which shouldn't be the case, the identity should be 8 as N can be any nucleotide..

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

This link should help.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue)

The degenerate nucleotide codes in red are treated as mismatches in nucleotide alignment. Too many such degenerate codes within an input nucleotide query will cause the BLAST webpage to reject the input. For protein queries, too many nucleotide-like code (A,C,G,T,N) may also cause similar rejection

ADD REPLY • link 7.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

The problem here may not be the N but the blastn parameters you're using. Short sequences require low word size and high E value.

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

The blastn doc here indicates that blastn can understand the IUPAC code but treats them as mismatches (except for N, see the small footnote).

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Yeah, I got that from the beginning, I need a tool that doesn't consider the degenerated nucleotide coding as a mismatch unless it's actually for a mismatch. For example, N should never be a mismatch, however R matches with A and G, and mismatches with T and C. Any suggestions?

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

I can think of three ways of dealing with the problem. One is to unfold the ambiguous sequences into all possible combinations and run blastn or any other tool, the second is to write your own Smith-Waterman or Needleman-Wunsch implementation, the third is to use regular expression. What would work for you depends on the context.

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Writing about implementing alignment algorithms made me think about the pairwise alignment algorithms in the Biostrings Bioconductor package which I believe deal with IUPAC codes in the expected way.

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

unfolding all the ambiguous sequences is impossible: a file of few ko can turn on to a file of more than 400 GO (I tried to do that, then I gave up as my disk of 500 GO run out of space..). I thought about implementing my own tool, but wanted to check if there is any existing tool first. For the regular expression, it can be a solution, but in this case I cannot support mismatches. But not sure, can you please elaborate more on the third option, maybe it can help.. Thanks!

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

The regex approach would work best if you're looking for relatively short exact matches. The idea is to expand the IUPAC codes into the corresponding character classes, e.g. in perl regex N -> ., Y -> [CT], W -> [AT] ... then use pattern matching. This could probably be made to work for a limited number of mismatches as well. As for the unfolding approach, there may be better ways than writing everything to files. For example, if you're aligning the ambiguous sequences to a common database, you could process them one by one, i.e. unfold then align then move to the next sequence, which is also easy to parallelize.

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you Jean, it's really good point to process these sequences one by one on the fly. But still, to run blast, I need to write these sequences into a file. NcbiblastnCommandline doesn't support a list or generator object. I am not sure how much multi-threading or parallel processing is going to help as every time I need to do I/O to write some sequence into a file, blast them, then get the results..

ADD REPLY • link 7.4 years ago by dabid • 0

0

Entering edit mode

The parallelization here consists in processing many sequences simultaneously. As long as you don't overload your filesystem, you can read and write several files in parallel. Exonerate has command line options that make parallelization easy (see here). You need to wrap your calls to your aligner with a script that will fork for each group of sequences. This can easily be done in perl either with the fork() function or through the use of a module.

ADD REPLY • link 7.4 years ago by Jean-Karim Heriche 27k