Question

Why searching using suffix array is so slow in human genome?

0

Entering edit mode

9.3 years ago

chhylp123 • 0

Hi all,

I am trying to locate patterns using suffix array. Thus, I have implemented a plaint suffix array, which searches patterns exploiting binary search. In small reference genomes like caenorhabditis elegans, it ran fast. However, it was extremely slow in human genome. Searching a pattern of length 20 needs more than 1 minutes.

I know binary search in large genomes will incur more cache misses, but I think one minute per pattern is abnormal. Please help me to solve this problem, or introduce an implement of suffix array that is suited for large genome.

Thanks

genome sequence software-error alignment • 1.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.3 years ago by chhylp123 • 0

Ram · Answer 1 · 2015-09-20

We can't know specifically how a custom algorithm is working. You could have any kind of bug.

But There's also good reason for this kind of search to be slow on the human genome. So maybe everything is working correctly.

The genome is very large, It should be slower than a small genome. You could test how the speed scales with this parameter and see it is non-linear.

The human genome also contains a lot of 'pathological' sequence for this task. Repetitive sections will cause your index to have unbalanced properties. Consider a centromere segment of two hundred or more "A" bases in a row. How does your algorithm handle that?

You might want to operate on a 'repeat masked' version rather than the whole build.