Question

Find Substring of Trie Keys

0

Entering edit mode

9.8 years ago

Michael Alonge • 0

This seems like it should be a common problem yet I can't seem to find anything that exactly fits my needs.

I have 2 sequence(fasta) files. One is an assembly of the other, so I would like to make a trie or suffix tree out of the assembled sequences, and search each for the sub sequences.

For example:

s1 = 'ATTCCG'
s2 = 'ATT'
s3 = 'CCG'

I would like to use s1 as a key in a trie, and search it for substrings s2 and s3.

Here is what I have tried so far:

1. Bio pythons trie and triefind:

This only allows for the whole key to match, or for a prefix to match. So in the above example, I could only successfully search s2 but not s3.

2. Suffix tree for python.

I looked at numbers 2 and 4 from this but neither seemed capable of handling the immense string sizes that I am working with (each string can be up to a few million characters long).

If possible, I would prefer to use the Trie because I have used it on this sized dataset before with success. Is there a way to search for substrings that are not whole matches or prefixes?

If not, what is a suitable suffix tree library that can handle very large strings?

I am using a linux 24 core 128 GB RAM machine.

Thank you

Assembly • 2.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Michael Alonge • 0

Ram · Answer 1 · 2015-06-19

2

Entering edit mode

9.8 years ago

Rayan Chikhi ★ 1.6k

You could start by looking at suffix arrays.. I'd recommend you take a look at wikipedia or one of the classical books for string algorithms or bioinformatics ("Algorithms on Strings, Trees and Sequences" by Gusfield, or, http://genome-scale.info is another recent one)

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Gusfield's book discusses how to convert a suffix tree into a suffix array, and then search for pattern P in suffix array T in O(n + log m) time. See pgs. 149-155.

ADD REPLY • link 9.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Are non compressed suffix arrays sufficiently smaller in size so as to allow genome scale indexing?

ADD REPLY • link 9.8 years ago by Michael Alonge • 0

0

Entering edit mode

A suffix array is just one integer (in a naive implementation, 8 bytes) per genome position. You mentioned to have 128 GB of RAM, so, a human genome should fit.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Rayan Chikhi ★ 1.6k