Hello,
I want to compile a BED file with all genomic regions where I find a given sequence (say, I have a 10-nucleotide sequence, and 10^9 nucleotides in the genome). I have previously tried solving this task with a weight matrix based approach, but it takes too long. So I decided to simplify the task: just a fixed sequence, no mismatches. I am thinking of something like Bowtie, but with all non-unique hits reported. Looks like a pretty simple task. Did someone tried solving this?
Thanks!
Well, if I got this right, my question is: do you need to do it only once or will you be repeating it over and over again. If only once then probably a simple parser will suffice (you can expect long execution times due to, probably iterative sequence queries - but you can fork that). if not , than if you are looking for exact matches I would suggest partial suffix array + restricted LCP computation (10-mers). If you have enough RAM than the execution, if properly implemented, should take less then 1h on a single cpu (given a 10^9 genome size and 10^9 10-mers). If mismatches are important then bowtie is probably the way to go but I cannot make any time estimates until I have some info on the number of 10-mers and which genome are you trying to map them to. Then again, maybe I completely misunderstood the problem :)
My estimate of the number of occurrences (without mismatches) is about 300. With two mismatches - about 500. It's human genome.
by "number of occurrences" you mean the number of times a 10-mer appears in a genome? Hm.. given that 25% of the human genome is un-mappable by 23-mers under the uniform distribution assumption I would expect at least 4 occurrences. so if you are expecting 300-500 that is a really repetitive region (satellites or something). However, my question was for how many of those 10-mers you wish to find the location in the genome. Either way bowtie seams like a logical start: if many 10-mers are to be checked (which is usually the case) or if you need to repeat it many times over.