Question

Find all occurrences of a fixed sequence in a large genome

1

Entering edit mode

10.4 years ago

qwerty ▴ 50

Hello,

I want to compile a BED file with all genomic regions where I find a given sequence (say, I have a 10-nucleotide sequence, and 10^9 nucleotides in the genome). I have previously tried solving this task with a weight matrix based approach, but it takes too long. So I decided to simplify the task: just a fixed sequence, no mismatches. I am thinking of something like Bowtie, but with all non-unique hits reported. Looks like a pretty simple task. Did someone tried solving this?

Thanks!

next-gen genome alignment Bowtie • 3.6k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by qwerty ▴ 50

1

Entering edit mode

Well, if I got this right, my question is: do you need to do it only once or will you be repeating it over and over again. If only once then probably a simple parser will suffice (you can expect long execution times due to, probably iterative sequence queries - but you can fork that). if not , than if you are looking for exact matches I would suggest partial suffix array + restricted LCP computation (10-mers). If you have enough RAM than the execution, if properly implemented, should take less then 1h on a single cpu (given a 10^9 genome size and 10^9 10-mers). If mismatches are important then bowtie is probably the way to go but I cannot make any time estimates until I have some info on the number of 10-mers and which genome are you trying to map them to. Then again, maybe I completely misunderstood the problem :)

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by mxs ▴ 530

0

Entering edit mode

My estimate of the number of occurrences (without mismatches) is about 300. With two mismatches - about 500. It's human genome.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by qwerty ▴ 50

0

Entering edit mode

by "number of occurrences" you mean the number of times a 10-mer appears in a genome? Hm.. given that 25% of the human genome is un-mappable by 23-mers under the uniform distribution assumption I would expect at least 4 occurrences. so if you are expecting 300-500 that is a really repetitive region (satellites or something). However, my question was for how many of those 10-mers you wish to find the location in the genome. Either way bowtie seams like a logical start: if many 10-mers are to be checked (which is usually the case) or if you need to repeat it many times over.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by mxs ▴ 530

0

Entering edit mode

10.4 years ago

qwerty ▴ 50

I think I would try proceeding with Bowtie first. But it looks that it would require a special index build for this task. Here is what is written in the Bowtie manual:

-a/--all: Report all valid alignments per read or pair (default: off). Validity of alignments is determined by the alignment policy (combined effects of -n^[1], -v^[2], -l^[3], and -e^[4]). If more than one valid alignment exists and the --best^[5] and --strata^[6] options are specified, then only those alignments belonging to the best alignment "stratum" will be reported. Bowtie is designed to be very fast for small -k^[7] but bowtie can become significantly slower if -a/--all^[8] is specified. If you would like to use Bowtie with -a^[9], consider building an index with a denser suffix-array sample, i.e. specify a smaller -o/--offrate^[10] when invoking bowtie-build for the relevant index (see the Performance tuning section for details).

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.4 years ago by qwerty ▴ 50

1

Entering edit mode

In the time taken for bowtie to index the genome, you would already have run things in biopython or another program. If you need to do this many times, then yes, bowtie or something like it will be more convenient.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

10.4 years ago

James Ashmore ★ 3.5k

The easiest way would be to open the genome of your choice in IGV, then use the search motif tool to find all occurrences, then save the resulting feature track as a BED file.

ADD COMMENT • link 10.4 years ago by James Ashmore ★ 3.5k

score 1 · Accepted Answer · 2015-02-05

1

Entering edit mode

10.4 years ago

Devon Ryan 105k

It'd probably be faster to just use Biopython. Iterate over the entries in a fasta file, and use find iteratively to produce all of the found regions.

ADD COMMENT • link 10.4 years ago by Devon Ryan 105k