I have sequences obtained by Sanger sequencing on metagenomic DNA that contain ambiguous bases (due to amplification of two or more highly homologous sequences).
I want to look up for best possible matches from database. BLAST finds matches but treats ambiguous bases as mismatch.
I plan to take the sequences found by BLAST (mySeq -> BLASTn -> all.BLAST.seqs) and align them back to my sequence to find the best possible match/es
taking into account the meaning of ambiguous positions.
I think it can be performed by multiple alignment in R with the DECIPHER
package to set of all possible sequences generated by Disambiguate
.
It would look something like:
library(DECIPHER)
permutations <- Disambiguate(mySeq) #expands ambiguities into all permutations - quite a lot!
all.BLAST.seqs <- OrientNucleotides(all.BLAST.seqs) #to put the seqs found by Blast in the same orientation
aligned <- AlignSeqs(c(all.BLAST.seqs, permutations))
d <- DistanceMatrix(aligned) # calculate a maximum likelihood tree
dend <- IdClusters(d, type="dendrogram", myXStringSet=aligned)
dend <- dendrapply(dend) #to create phylogenetic tree to look for the closest matches to mySeq
However this I believe might not be the best way. Especially as disambiguate
may produce quite a lot of sequences.
Does anybody knows better, faster way to find best matching sequences taking into account information from ambiguous positions?
Like blastn, other software tend to treat ambiguities as mismatches which I think is the conservative way of doing it. Also have a look at exonerate, I don't remember exactly how it treats ambiguity codes but as a last resort you could pass it a custom substitution matrix that includes ambiguity codes (I think the matrices are hard-coded in blast). The disambiguate approach could be viable if you parallelize it.
See some more discussion in this previous post.
Hi! I would not use
Disambiguate
if understand the output of this function correctly. With this approach you may generate a huge number of artificial sequences which does not exist even in nature not only in your samples. E.g. having a sequenceARTY
does not mean that there are mandatory all of four haplotypesAATT
,AATC
,AGTT
andAGTC
in your samples. Moreover, a presence of a such ambiguity bases in sanger sequencing reads may be a sequencing artefacts instead of a true sequence variability. If there are a few ambiguity bases in your data i guess it would not influence much on the results of your study.Hi! I have expected to obtain sequences with ambiguous positions. I have sequenced amplicons obtained on metagenomic DNA from environmental samples. The PCR was with universal primers targeting conserved regions within phage genome of specific type to amplify sequences from various virus species. The sequences, judging on Sanger chromatogram reads, are fine. There are stretches of sequence with single high peaks, and several positions with two peaks of close height. The
disambiguate
produces hundreds to even thousands possible sequences. This makes my approach quite troublesome.