In R or Python, How do I implement an allele sequence "autocomplete" tool given a dictionary of possible allele sequence matches?
2
1
Entering edit mode
5.7 years ago

In R or Python, how do I implement an allele sequence "autocomplete" tool given a dictionary of possible allele sequence matches?

That is, given, say an allele sequence:

CC_ATCGATCGTA_GATCGGCAA_GTGA

And given a limited number of dictionary values:

Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 2: ACTATCTATCGTAAGATCGGCAACGTGG
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA
Allele 4: TCTATCTATCGTAAGATCGGCAACGTGG

The program will return to me Allele 1 and Allele 3 as possible reconstructions of the original allele sequence. Since:

Original:

CC_ATCGATCGTA_GATCGGCAA_GTGA

Matches:

Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA

Ideally, the tool will search through the dictionary values as a tree search rather than exhaustively comparing the incomplete string with each dictionary entry. In other words, upon comparison of the first letter of alleles 2 and 3, they are automatically eliminated from consideration given that the first nucleotide does not match with the original string.

snp sequencing R python assembly • 2.1k views
ADD COMMENT
1
Entering edit mode

I would zip() the strings and iterate over the pairs, break if mismatch, continue if _ and return if reached the end

ADD REPLY
3
Entering edit mode
5.7 years ago

There's a lot of ways to do it, depending how large your dictionary is, or how often/quickly those are queried.

My intuition is that you probably want to use a Trie structure where you would insert all your alleles, and then query your trie once it's built.

I wrote a quick solution using LexPy:

from lexpy.trie import Trie
trie = Trie()
input_alleles = [
    "CCGATCGATCGTACGATCGGCAAGGTGA",
    "ACTATCTATCGTAAGATCGGCAACGTGG",
    "CCAATCGATCGTACGATCGGCAACGTGA",
    "TCTATCTATCGTAAGATCGGCAACGTGG"
]

trie.add_all(input_alleles)

query = "CC_ATCGATCGTA_GATCGGCAA_GTGA"
query_dist = query.count("_")

matches = trie.search_within_distance(query, dist=query_dist)

print matches

Results:

python trie-demo.py
['CCAATCGATCGTACGATCGGCAACGTGA', 'CCGATCGATCGTACGATCGGCAAGGTGA']

Further readings:

https://en.wikipedia.org/wiki/Trie
https://en.wikipedia.org/wiki/Finite-state_machine
https://en.wikipedia.org/wiki/Suffix_tree

ADD COMMENT
0
Entering edit mode

Amazing! I am new to computer programming and learning these things really help me add new skills to my toolbox. So for that, thank you!

ADD REPLY
0
Entering edit mode
5.7 years ago
zx8754 12k

There must be some ready R packages for this specific task, but here is a simple one using agrep (not sure how well this will perform, but works for your example input):

x <- "CC_ATCGATCGTA_GATCGGCAA_GTGA"

y <- c("CCGATCGATCGTACGATCGGCAAGGTGA",
       "ACTATCTATCGTAAGATCGGCAACGTGG",
       "CCAATCGATCGTACGATCGGCAACGTGA",
       "TCTATCTATCGTAAGATCGGCAACGTGG")

y[ agrep(x, y) ]
# [1] "CCGATCGATCGTACGATCGGCAAGGTGA" "CCAATCGATCGTACGATCGGCAACGTGA"
ADD COMMENT

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6