Question

Comparing millions of trimmed reads to large database

0

Entering edit mode

3.6 years ago

geneticatt ▴ 140

Hi all,

I have a set of reads which I've trimmed down to 21nt based on the sequencing experiment. I'd like to compare these 21nt sequences to a database of 300,000 21nt sequences to annotate each read. I attempted to use bowtie2 by making a indices for the database then mapping the reads, but the mapping rate was lower than expected, suggesting that the bowtie read mapping method isn't amenable to this type of comparison.

Next I tried using Blastn, but it's apparently too slow for this scale of comparison.

Can someone please recommend a tool or approach for making so many exact comparisons?

Thanks

bowtie2 blastn • 934 views

ADD COMMENT • link updated 3.6 years ago by h.mon 35k • written 3.6 years ago by geneticatt ▴ 140

1

Entering edit mode

You should try using bowtie v.1.x. You need that to do ungapped alignments with small reads such as these.
You may be able to use blat as well.
Using seqkit grep.
Using bbmap.sh with ambig=all vslow perfectmode maxsites=1000 options.

ADD REPLY • link 3.6 years ago by GenoMax 148k

score 0 · Answer 1 · 2021-05-04

0

Entering edit mode

3.6 years ago

h.mon 35k

With these short sequences (is this microRNA), I suspect clustering will be more efficient than mapping. First, deduplicate both query and subject files with, e.g., VSEARCH, CD-HIT or Dedupe.sh. Then, the same tools can be used to find the common sequences between both datasets.

ADD COMMENT • link 3.6 years ago by h.mon 35k