Hi, I have a numerous 100-mer sequences (let's say billions). What I am going to do is to query these sequences to entire human genome to find "perfect matches" only.
I first tried to do this using BLAST+ (megablast). I constructed blastdb and index using hg19. I gave below options to benefit from allowing perfect matches only
blastn -query myseqs.fa -db hg19 -use_index true -index_name hg19_index -word_size 100 -outfmt "7 qacc qstart qend sacc sstart send sstrand" -max_target_seqs 5 -num_threads 4
Here, I gave word size '100' to achieve my goal. It does retrieve only perfect matches. But the problem is the speed, which is about a million queries per an hour. Well, someone can say this is fast enough, but I want it to be faster!
On the other hand, I could use BLAT instead of BLAST, which is generally accepted as a faster tool. I also constructed my local BLAT server (gfServer and gfClient), but I am not sure how to control BLAT parameters to get only perfect matches.
So, what would be the fastest way to retrieve perfect matches in BLAST/BLAT?
fastmap still does more than your need. In principle, we can have something several times faster than fastmap for your task.
Could you give me a few examples? I tried SSAHA but it was much slower than fastmap.
No, ssaha2 won't do. You need a new but very simple aligner to align a read in full length only.