This seems like it should be a common problem yet I can't seem to find anything that exactly fits my needs.
I have 2 sequence(fasta) files. One is an assembly of the other, so I would like to make a trie or suffix tree out of the assembled sequences, and search each for the sub sequences.
For example:
s1 = 'ATTCCG'
s2 = 'ATT'
s3 = 'CCG'
I would like to use s1 as a key in a trie, and search it for substrings s2 and s3.
Here is what I have tried so far:
1. Bio pythons trie and triefind:
This only allows for the whole key to match, or for a prefix to match. So in the above example, I could only successfully search s2 but not s3.
2. Suffix tree for python.
I looked at numbers 2 and 4 from this but neither seemed capable of handling the immense string sizes that I am working with (each string can be up to a few million characters long).
If possible, I would prefer to use the Trie because I have used it on this sized dataset before with success. Is there a way to search for substrings that are not whole matches or prefixes?
If not, what is a suitable suffix tree library that can handle very large strings?
I am using a linux 24 core 128 GB RAM machine.
Thank you
Gusfield's book discusses how to convert a suffix tree into a suffix array, and then search for pattern P in suffix array T in O(n + log m) time. See pgs. 149-155.
Are non compressed suffix arrays sufficiently smaller in size so as to allow genome scale indexing?
A suffix array is just one integer (in a naive implementation, 8 bytes) per genome position. You mentioned to have 128 GB of RAM, so, a human genome should fit.