Question

Fast tool to match longer sequences to shorter ones.

1

Entering edit mode

10.0 years ago

rainfield ▴ 10

I have the following problem in my project:

I have a small fasta file, A, that contains serveral thousands of very short sequences (20nt to 80nt). And a large NGS sequencing file, B, that contains reads of length 100nt to 200nt. I want to do a filteration of the reads file so that the result only contains those reads which CONTAIN one of the sequences in file A. By 'CONTAIN', I mean one or more regions in the read have very high identity with sequences in file A (For example, <=2 mismathces and <=2 indels).

The first thought is to use mapping tools like bowtie to map sequences in file A to file B. The problem is that file B is very large (>10G), thus bowtie-index just takes too much time. Another problem is that it can not handle indels.

I tried BLAST. First I created a db using the smaller file A. Then search reads in B against the db. But this is very slow.

The second try was that creating a db using the read file B. The problem with this is that creating db takes long time. And in my case, file A is fix and file B are different for different runs thus I can not reuse the db.

I also tried BLAT with file A as db, it was quite fast. But it missed a lot of hits.

Any idea which tool/tools I can use for this scenario?

Thanks

sequence • 2.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by rainfield ▴ 10

0

Entering edit mode

"bwa index A.fa; bwa mem -t8 -k8 -T15 A.fa B.fa > B.sam". You then need to write a script to filter B.sam. It will take some hours I guess. Yara from seqan might be a better choice in terms of algorithm, but I don't know for sure.

ADD REPLY • link 10.0 years ago by lh3 33k

0

Entering edit mode

What is the homology level of sequences in file A?

ADD REPLY • link 10.0 years ago by mikhail.shugay 3.5k

Ram · Answer 1 · 2014-12-10

I suggest using BBDuk. It does not do alignment, just kmer-matching; however, it will very quickly whittle down your set of reads to those sharing kmers with the sequences in your fasta file, like this:

bbduk.sh -Xmx31g in=reads.fq outm=matched.fq ref=sequences.fa edits=2 k=20

The actual amount of memory it needs depends on the edit distance you allow (exponentially), and the size of your reference (linearly), so -Xmx31g could be revised up or down. But that command will filter and retain only reads sharing at least one 20-mer with your reference set, allowing up to 2 indels or substitutions. It would have higher specificity if your shortest ref sequences were longer than 20bp but what can you do?

Anyway, the output will not exactly fit your requirements of "only reads containing specified sequences with <=2 indels and <=2 mismatches", but it will be a superset of it. That may be adequate for your needs, or you can continue to refine from there, which will be easier because you'll have far fewer reads to work with. And, incidentally, BBMap indexes large references far faster than bowtie, though it uses more memory.

Ram · Answer 2 · 2014-12-10

0

Entering edit mode

10.0 years ago

SES 8.6k

For general matching tasks like this, I always reach for Vmatch. It is quite easy to set length and distance thresholds for matches with this tool, and it is fast enough that you won't worry about having millions of sequences. If the availability of Vmatch is an issue, I would use Tallymer, which is a part of genometools. Tallymer is not state-of-the-art in terms of speed anymore, but it is the only matching tool I know of that allows you to use arbitrary length k-mers for matching. So, Tallymer will work fine with indexing long k-mers (hundreds or thousands of bases in length) where other tools like Jellyfish are only useful for very short k-mers.

Regardless of the approach to take, you will likely have issues indexing a >10 Gb file (at least in terms of speed). I would split the input, and then you could parallelize the indexing/matching.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by SES 8.6k

0

Entering edit mode

For such tasks, it is usually better to index the small file. Even mapping a 10GB file against human genome does not take long; mapping against 1000x100bp sequences should be much faster and much easier to parallelize.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by lh3 33k

0

Entering edit mode

Good point, that would make the task easier to parallelize and likely much faster.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by SES 8.6k