How to perform alignment of nucleotide sequence containing ambiguous bases
3
0
Entering edit mode
4.5 years ago
mschmidt ▴ 80

I have sequences obtained by Sanger sequencing on metagenomic DNA that contain ambiguous bases (due to amplification of two or more highly homologous sequences). I want to look up for best possible matches from database. BLAST finds matches but treats ambiguous bases as mismatch. I plan to take the sequences found by BLAST (mySeq -> BLASTn -> all.BLAST.seqs) and align them back to my sequence to find the best possible match/es taking into account the meaning of ambiguous positions. I think it can be performed by multiple alignment in R with the DECIPHER package to set of all possible sequences generated by Disambiguate.

It would look something like:

library(DECIPHER) 
permutations <- Disambiguate(mySeq) #expands ambiguities into all permutations - quite a lot! 

all.BLAST.seqs <- OrientNucleotides(all.BLAST.seqs) #to put the seqs found by Blast in the same orientation 

aligned <- AlignSeqs(c(all.BLAST.seqs, permutations)) 

d <- DistanceMatrix(aligned) # calculate a maximum likelihood tree 

dend <- IdClusters(d, type="dendrogram", myXStringSet=aligned) 

dend <- dendrapply(dend) #to create phylogenetic tree to look for the closest matches to mySeq

However this I believe might not be the best way. Especially as disambiguate may produce quite a lot of sequences.

Does anybody knows better, faster way to find best matching sequences taking into account information from ambiguous positions?

R alignment sequence nucleotide • 4.8k views
ADD COMMENT
0
Entering edit mode

Like blastn, other software tend to treat ambiguities as mismatches which I think is the conservative way of doing it. Also have a look at exonerate, I don't remember exactly how it treats ambiguity codes but as a last resort you could pass it a custom substitution matrix that includes ambiguity codes (I think the matrices are hard-coded in blast). The disambiguate approach could be viable if you parallelize it.
See some more discussion in this previous post.

ADD REPLY
0
Entering edit mode

Hi! I would not use Disambiguate if understand the output of this function correctly. With this approach you may generate a huge number of artificial sequences which does not exist even in nature not only in your samples. E.g. having a sequence ARTY does not mean that there are mandatory all of four haplotypes AATT, AATC, AGTT and AGTC in your samples. Moreover, a presence of a such ambiguity bases in sanger sequencing reads may be a sequencing artefacts instead of a true sequence variability. If there are a few ambiguity bases in your data i guess it would not influence much on the results of your study.

ADD REPLY
0
Entering edit mode

Hi! I have expected to obtain sequences with ambiguous positions. I have sequenced amplicons obtained on metagenomic DNA from environmental samples. The PCR was with universal primers targeting conserved regions within phage genome of specific type to amplify sequences from various virus species. The sequences, judging on Sanger chromatogram reads, are fine. There are stretches of sequence with single high peaks, and several positions with two peaks of close height. The disambiguate produces hundreds to even thousands possible sequences. This makes my approach quite troublesome.

ADD REPLY
3
Entering edit mode
4.5 years ago
mschmidt ▴ 80

I have found solution finally. For those with similar problem:

library(Biostrings)
#generate nucleotide substitution matrix for all 15 base letters needs baseOnly = FALSE
nsm <- nucleotideSubstitutionMatrix(match = 5, mismatch = -4, baseOnly = FALSE, type = "DNA")
s1 <- DNAString("ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAASGGAADCGCAAAGYTTCAAG")
s2 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
pairwiseAlignment(s1, s2, substitutionMatrix = nsm, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAA-SGGAADCGCAAAGYTT-CAAG
subject: ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA-GGAAACGCAAAGTTTTCAAG
score: 231.5

However, in case anyone can add a comment I appreciate very much any response. Best regards, Marcin

ADD COMMENT
0
Entering edit mode

Hi! I'm wondering how substitution from W to R will be treated in nsm matrix? Will it be a mismatch?

ADD REPLY
1
Entering edit mode
> print(nsm)
      A     C     G     T     M     R     W     S     Y     K     V     H     D
A  5.00 -4.00 -4.00 -4.00  0.50  0.50  0.50 -4.00 -4.00 -4.00 -1.00 -1.00 -1.00
C -4.00  5.00 -4.00 -4.00  0.50 -4.00 -4.00  0.50  0.50 -4.00 -1.00 -1.00 -4.00
G -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50 -4.00  0.50 -1.00 -4.00 -1.00
T -4.00 -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50  0.50 -4.00 -1.00 -1.00
M  0.50  0.50 -4.00 -4.00  0.50 -1.75 -1.75 -1.75 -1.75 -4.00 -1.00 -1.00 -2.50
R  0.50 -4.00  0.50 -4.00 -1.75  0.50 -1.75 -1.75 -4.00 -1.75 -1.00 -2.50 -1.00
W  0.50 -4.00 -4.00  0.50 -1.75 -1.75  0.50 -4.00 -1.75 -1.75 -2.50 -1.00 -1.00
S -4.00  0.50  0.50 -4.00 -1.75 -1.75 -4.00  0.50 -1.75 -1.75 -1.00 -2.50 -2.50
Y -4.00  0.50 -4.00  0.50 -1.75 -4.00 -1.75 -1.75  0.50 -1.75 -2.50 -1.00 -2.50
K -4.00 -4.00  0.50  0.50 -4.00 -1.75 -1.75 -1.75 -1.75  0.50 -2.50 -2.50 -1.00
V -1.00 -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00
H -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -1.00 -2.50 -2.00 -1.00 -2.00
D -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00 -1.00
B -4.00 -1.00 -1.00 -1.00 -2.50 -2.50 -2.50 -1.00 -1.00 -1.00 -2.00 -2.00 -2.00
N -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75
      B     N
A -4.00 -1.75
C -1.00 -1.75
G -1.00 -1.75
T -1.00 -1.75
M -2.50 -1.75
R -2.50 -1.75
W -2.50 -1.75
S -1.00 -1.75
Y -1.00 -1.75
K -1.00 -1.75
V -2.00 -1.75
H -2.00 -1.75
D -2.00 -1.75
B -1.00 -1.75
N -1.75 -1.75
ADD REPLY
2
Entering edit mode
4.5 years ago
mschmidt ▴ 80

For for those who prefer DECIPHER it is also possible using custom nucleotide substitution matrix nsm

st = c("ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTCGTATGC",
       "ACGTGTTCGTAGCTGATCGGCGGATGGCACTGATCGTTTGCTAGTCGTAGTTCGTATGC",
       "ACGYGTACGTAGCTGATCGSGGGATGGCCCTGVTCGTATGCTKGTCGTAGTTCGTATGC",
       "ACGTGTACGTTGCTGATCGGGGGATGGCCCTGATCGTTTGCTAGTCGTAGTTTCGTATGC",
       "ACGKGTACGTWGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTTCGTATGC",
       "ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTGTAGTTCGTATC")
st = DNAStringSet(st)
aligned <- AlignSeqs(st,
                         guideTree = NULL,
                         iterations = 1,
                         refinements = 1,
                         gapOpening = -1,
                         gapExtension = -2,
                         useStructures = FALSE,
                         FUN = AdjustAlignment,
                         levels = c(0.9, 0.7, 0.7, 0.4, 10, 5, 5, 2),
                         processors = 1,
                         verbose = TRUE,
                         substitutionMatrix = nsm)
BrowseSeqs(aligned, highlight=0)

gives:

                           20                  40                  60        
         '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|        
    1    ···T··A···A········GG·······C···A····A····A··C····-·······G·    59    
    2    ···T··T···A········GC·······A···A····T····A··C····-·······G·    59    
    3    ···Y··A···A·········G·······C········A····K··C····-·······G·    59    
    4    ···T··A···T········GG·······C···A····T····A··C····T·······G·    60    
    5    ···K··A············GG·······C···A····A····A··C····T·······G·    60    
    6    ···T··A···A········GG·······C···A····A····A··-····-·······-·    57    
ConsensusACGBGTWCGTWGCTGATCGSSGGATGGCMCTGVTCGTWTGCTDGT+GTAG+TTCGTAT+C    60
ADD COMMENT
1
Entering edit mode
4.2 years ago
Biodeer ▴ 40

Hi, a tool named mesquite may help you manually edit the ambiguous bases with alignment sequences. http://mesquiteproject.org/mesquiteArchives/mesquite2.75/Mesquite_Folder/docs/mesquite/Diversification/diversification.html

ADD COMMENT

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6