Question

How to perform alignment of nucleotide sequence containing ambiguous bases

0

Entering edit mode

5.0 years ago

mschmidt ▴ 80

I have sequences obtained by Sanger sequencing on metagenomic DNA that contain ambiguous bases (due to amplification of two or more highly homologous sequences). I want to look up for best possible matches from database. BLAST finds matches but treats ambiguous bases as mismatch. I plan to take the sequences found by BLAST (mySeq -> BLASTn -> all.BLAST.seqs) and align them back to my sequence to find the best possible match/es taking into account the meaning of ambiguous positions. I think it can be performed by multiple alignment in R with the DECIPHER package to set of all possible sequences generated by Disambiguate.

It would look something like:

library(DECIPHER) 
permutations <- Disambiguate(mySeq) #expands ambiguities into all permutations - quite a lot! 

all.BLAST.seqs <- OrientNucleotides(all.BLAST.seqs) #to put the seqs found by Blast in the same orientation 

aligned <- AlignSeqs(c(all.BLAST.seqs, permutations)) 

d <- DistanceMatrix(aligned) # calculate a maximum likelihood tree 

dend <- IdClusters(d, type="dendrogram", myXStringSet=aligned) 

dend <- dendrapply(dend) #to create phylogenetic tree to look for the closest matches to mySeq

However this I believe might not be the best way. Especially as disambiguate may produce quite a lot of sequences.

Does anybody knows better, faster way to find best matching sequences taking into account information from ambiguous positions?

R alignment sequence nucleotide • 5.5k views

ADD COMMENT • link updated 4.7 years ago by Biodeer ▴ 50 • written 5.0 years ago by mschmidt ▴ 80

0

Entering edit mode

Like blastn, other software tend to treat ambiguities as mismatches which I think is the conservative way of doing it. Also have a look at exonerate, I don't remember exactly how it treats ambiguity codes but as a last resort you could pass it a custom substitution matrix that includes ambiguity codes (I think the matrices are hard-coded in blast). The disambiguate approach could be viable if you parallelize it.
See some more discussion in this previous post.

ADD REPLY • link 5.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi! I would not use Disambiguate if understand the output of this function correctly. With this approach you may generate a huge number of artificial sequences which does not exist even in nature not only in your samples. E.g. having a sequence ARTY does not mean that there are mandatory all of four haplotypes AATT, AATC, AGTT and AGTC in your samples. Moreover, a presence of a such ambiguity bases in sanger sequencing reads may be a sequencing artefacts instead of a true sequence variability. If there are a few ambiguity bases in your data i guess it would not influence much on the results of your study.

ADD REPLY • link 5.0 years ago by Denis ▴ 310

0

Entering edit mode

Hi! I have expected to obtain sequences with ambiguous positions. I have sequenced amplicons obtained on metagenomic DNA from environmental samples. The PCR was with universal primers targeting conserved regions within phage genome of specific type to amplify sequences from various virus species. The sequences, judging on Sanger chromatogram reads, are fine. There are stretches of sequence with single high peaks, and several positions with two peaks of close height. The disambiguate produces hundreds to even thousands possible sequences. This makes my approach quite troublesome.

ADD REPLY • link 5.0 years ago by mschmidt ▴ 80

score 3 · Answer 1 · 2020-05-23

3

Entering edit mode

5.0 years ago

mschmidt ▴ 80

I have found solution finally. For those with similar problem:

library(Biostrings)
#generate nucleotide substitution matrix for all 15 base letters needs baseOnly = FALSE
nsm <- nucleotideSubstitutionMatrix(match = 5, mismatch = -4, baseOnly = FALSE, type = "DNA")
s1 <- DNAString("ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAASGGAADCGCAAAGYTTCAAG")
s2 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
pairwiseAlignment(s1, s2, substitutionMatrix = nsm, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAA-SGGAADCGCAAAGYTT-CAAG
subject: ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA-GGAAACGCAAAGTTTTCAAG
score: 231.5

However, in case anyone can add a comment I appreciate very much any response. Best regards, Marcin

ADD COMMENT • link 5.0 years ago by mschmidt ▴ 80

0

Entering edit mode

Hi! I'm wondering how substitution from W to R will be treated in nsm matrix? Will it be a mismatch?

ADD REPLY • link 5.0 years ago by Denis ▴ 310

1

Entering edit mode

> print(nsm)
      A     C     G     T     M     R     W     S     Y     K     V     H     D
A  5.00 -4.00 -4.00 -4.00  0.50  0.50  0.50 -4.00 -4.00 -4.00 -1.00 -1.00 -1.00
C -4.00  5.00 -4.00 -4.00  0.50 -4.00 -4.00  0.50  0.50 -4.00 -1.00 -1.00 -4.00
G -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50 -4.00  0.50 -1.00 -4.00 -1.00
T -4.00 -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50  0.50 -4.00 -1.00 -1.00
M  0.50  0.50 -4.00 -4.00  0.50 -1.75 -1.75 -1.75 -1.75 -4.00 -1.00 -1.00 -2.50
R  0.50 -4.00  0.50 -4.00 -1.75  0.50 -1.75 -1.75 -4.00 -1.75 -1.00 -2.50 -1.00
W  0.50 -4.00 -4.00  0.50 -1.75 -1.75  0.50 -4.00 -1.75 -1.75 -2.50 -1.00 -1.00
S -4.00  0.50  0.50 -4.00 -1.75 -1.75 -4.00  0.50 -1.75 -1.75 -1.00 -2.50 -2.50
Y -4.00  0.50 -4.00  0.50 -1.75 -4.00 -1.75 -1.75  0.50 -1.75 -2.50 -1.00 -2.50
K -4.00 -4.00  0.50  0.50 -4.00 -1.75 -1.75 -1.75 -1.75  0.50 -2.50 -2.50 -1.00
V -1.00 -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00
H -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -1.00 -2.50 -2.00 -1.00 -2.00
D -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00 -1.00
B -4.00 -1.00 -1.00 -1.00 -2.50 -2.50 -2.50 -1.00 -1.00 -1.00 -2.00 -2.00 -2.00
N -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75
      B     N
A -4.00 -1.75
C -1.00 -1.75
G -1.00 -1.75
T -1.00 -1.75
M -2.50 -1.75
R -2.50 -1.75
W -2.50 -1.75
S -1.00 -1.75
Y -1.00 -1.75
K -1.00 -1.75
V -2.00 -1.75
H -2.00 -1.75
D -2.00 -1.75
B -1.00 -1.75
N -1.75 -1.75

ADD REPLY • link 5.0 years ago by mschmidt ▴ 80

score 2 · Answer 2 · 2020-05-26

For for those who prefer DECIPHER it is also possible using custom nucleotide substitution matrix nsm

st = c("ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTCGTATGC",
       "ACGTGTTCGTAGCTGATCGGCGGATGGCACTGATCGTTTGCTAGTCGTAGTTCGTATGC",
       "ACGYGTACGTAGCTGATCGSGGGATGGCCCTGVTCGTATGCTKGTCGTAGTTCGTATGC",
       "ACGTGTACGTTGCTGATCGGGGGATGGCCCTGATCGTTTGCTAGTCGTAGTTTCGTATGC",
       "ACGKGTACGTWGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTTCGTATGC",
       "ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTGTAGTTCGTATC")
st = DNAStringSet(st)
aligned <- AlignSeqs(st,
                         guideTree = NULL,
                         iterations = 1,
                         refinements = 1,
                         gapOpening = -1,
                         gapExtension = -2,
                         useStructures = FALSE,
                         FUN = AdjustAlignment,
                         levels = c(0.9, 0.7, 0.7, 0.4, 10, 5, 5, 2),
                         processors = 1,
                         verbose = TRUE,
                         substitutionMatrix = nsm)
BrowseSeqs(aligned, highlight=0)

gives:

                           20                  40                  60        
         '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|        
    1    ···T··A···A········GG·······C···A····A····A··C····-·······G·    59    
    2    ···T··T···A········GC·······A···A····T····A··C····-·······G·    59    
    3    ···Y··A···A·········G·······C········A····K··C····-·······G·    59    
    4    ···T··A···T········GG·······C···A····T····A··C····T·······G·    60    
    5    ···K··A············GG·······C···A····A····A··C····T·······G·    60    
    6    ···T··A···A········GG·······C···A····A····A··-····-·······-·    57    
ConsensusACGBGTWCGTWGCTGATCGSSGGATGGCMCTGVTCGTWTGCTDGT+GTAG+TTCGTAT+C    60

score 1 · Answer 3 · 2020-09-11

1

Entering edit mode

4.7 years ago

Biodeer ▴ 50

Hi, a tool named mesquite may help you manually edit the ambiguous bases with alignment sequences. http://mesquiteproject.org/mesquiteArchives/mesquite2.75/Mesquite_Folder/docs/mesquite/Diversification/diversification.html

ADD COMMENT • link 4.7 years ago by Biodeer ▴ 50