Dear Biostars folk,
I have been trying to come up with an efficient way to return all the NON-hits from a blastn query. Basically, I need to blast all reads in an Illumina library against one another, and only find the reads in set A that have not hit in B. Although I have a solution, it seems to be really slow, and I wonder if you could help me out. Here's what I did so far:
makeblastdb -in ${B} -dbtype nucl
blastn -query ${A} -db ${B} > blast_results.dat
cat blast_results.dat \
| grep 'No hits' -B 6 \
| grep 'Query' \
| cut -d '=' -f2 \
| cut -d ' ' -f2 \
> unique_read_ids.dat
cat unique_read_ids.dat \
| while read line; do sed -n "/${line}/,/>/p" $query \
| sed '$d'; done \
> result.fn
Now... this totally works, but it takes about a week to finish, although the blastn-part only take 1-2 days. So that last bit, where I fish out all the non-hits, it probably not so efficient.
Is there any way to let blastn return NON-hits, instead of all queries / all hits? If so, how does that work? I've tried setting something up with the e-values, but it always seems to be a minimal value...
Thanks for your time!
Have you considered using tabular format for output instead? It would be much faster to parse. I believe
-outfmt 7
includes sequence ID's that show no hits. Will check on that in a bit. You could also use-outfmt 17
to get SAM formatted alignments and use those to get unmapped reads.Why are you using blast BTW? It may be much faster to do this using a proper NGS aligner and collects reads that don't map.
I am using blast purely because I am reproducing someone's method and trying to compare it to other approaches (like a NGS aligner, as you say). I think outfmt 7 does in fact NOT include missed hits. Not sure about 17 though.
A test blast with different formats. Format 7 does list an empty record (for query
test1
below) when a hit is not found.Outfmt 7
Outfmt 6
Outfmt 17 (does not include sequence when a hit is not found so you would need to find ID's that get excluded)
I am confused as to what outfmt 7 shows... it says 0 hits found, but then it lists 1 hit, which is a perfect hit. Am I missing something?
EDIT: Oh wait, that's two queries. But it doesn't give the sequence when no hit is found, right?
That is correct. It has
0 hits found
.So, do you think the stuff I do with gripping / sed is then the only way forward? (again, given that I have to use blastn... which is unfortunately true for now)
It's very inefficient I think...
Blast was never designed to deal with NGS data. Try switching to the tabular output format. It should be better/faster than parsing regular blast output you have been using so far.
Yeah, but that still doesn't give me the sequence data, which is what I need. In the end, the blastn part is not the bottle neck right now, it's the grepping (which takes 3.5 days...)
Anyway, thanks for your help genomax.
Grepping time should become much shorter, if you have tabular data. Test it out. Find
0 hits found
in the file and go back two lines.Ah, that makes sense. Thank you.
Did you consider alternative approaches? For example, performing kmer set operations will probably be a lot faster than blasting and parsing the output.
A search for kmer set operations will return a good amount of alternatives. I have used KmerCrompressor a couple of times and it was fast and efficient.
I am definitely going to do this, but I'd like to try the blastn-approach nonetheless. It was used before, and like you, I'm not sure it's the best way to go. But in order to show this, I'd have to use blastn to repeat someone else's analysis.
But thanks for the tip, I think that tool sounds like a very good approach going forward =)