I have the following problem in my project:
I have a small fasta file, A, that contains serveral thousands of very short sequences (20nt to 80nt). And a large NGS sequencing file, B, that contains reads of length 100nt to 200nt. I want to do a filteration of the reads file so that the result only contains those reads which CONTAIN one of the sequences in file A. By 'CONTAIN', I mean one or more regions in the read have very high identity with sequences in file A (For example, <=2 mismathces and <=2 indels).
The first thought is to use mapping tools like bowtie to map sequences in file A to file B. The problem is that file B is very large (>10G), thus bowtie-index just takes too much time. Another problem is that it can not handle indels.
I tried BLAST. First I created a db using the smaller file A. Then search reads in B against the db. But this is very slow.
The second try was that creating a db using the read file B. The problem with this is that creating db takes long time. And in my case, file A is fix and file B are different for different runs thus I can not reuse the db.
I also tried BLAT with file A as db, it was quite fast. But it missed a lot of hits.
Any idea which tool/tools I can use for this scenario?
Thanks
"bwa index A.fa; bwa mem -t8 -k8 -T15 A.fa B.fa > B.sam". You then need to write a script to filter B.sam. It will take some hours I guess. Yara from seqan might be a better choice in terms of algorithm, but I don't know for sure.
What is the homology level of sequences in file A?