In R or Python, how do I implement an allele sequence "autocomplete" tool given a dictionary of possible allele sequence matches?
That is, given, say an allele sequence:
CC_ATCGATCGTA_GATCGGCAA_GTGA
And given a limited number of dictionary values:
Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 2: ACTATCTATCGTAAGATCGGCAACGTGG
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA
Allele 4: TCTATCTATCGTAAGATCGGCAACGTGG
The program will return to me Allele 1 and Allele 3 as possible reconstructions of the original allele sequence. Since:
Original:
CC_ATCGATCGTA_GATCGGCAA_GTGA
Matches:
Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA
Ideally, the tool will search through the dictionary values as a tree search rather than exhaustively comparing the incomplete string with each dictionary entry. In other words, upon comparison of the first letter of alleles 2 and 3, they are automatically eliminated from consideration given that the first nucleotide does not match with the original string.
I would
zip()
the strings and iterate over the pairs, break if mismatch, continue if _ and return if reached the end