Entering edit mode
4.2 years ago
endretoth
▴
40
Hi,
I would like to ask for advice how to solve the following problem. I would like to Blast sequences one-by-one (from one input fasta file) and output the best match for each! in one fasta file (NOT separate fasta for each). My fasta file with the query sequences has ~500,000 EST sequences and I would like to blast this on a genome of a species. The point is to do it one-by-one. Please, help me with your suggestions.
I have a UNIX server and have Blastn installed :) Is this possible at all?
Best, Thend
Did you already tried something? Because this is already how blast works... Or I don't understand the question of course.
EDIT: Maybe show the command that you used, mostly it easier for the community here to give guidance based on that command.
Hi gb,
At several forums I have read that even 4000-5000 query sequences are problematic mostly because it is a computationally intensive task. My query size is really big and the search probably will take long time to finish. Recently, I found BLAST-batch-helper for specifically this purpose (https://github.com/Lafudoci/BLAST-batch-helper), for large datasets. I'm designing my command now. I will let you know if it works.
Also, formerly I had a problem with the regular Blast search. I have tried a large dataset about 600,000ESTs and it resulted in 900 sequences. This was really strange because this is a unusually-unexpectedly low number. I'm not expecting that all EST will have a position somewhere in the genome (ESTs are from different species), but I expect much more.
Best, Thend
There is no way to get a 1-to-1 correlation from BLAST in one step. It will return you all the HSPs that fit your criteria for every sequence you give it. You will have to post-process the BLAST output file to obtain this.
Ah thanks for the clarification, now people can understand where the question is coming from. It is true that blasting can be computationally intensive this mostly has to do with the reference database. If the reference is small it can go pretty fast. A question that comes to my mind is that what are those 500,000 sequences. If it are raw reads you could consider to cluster or do something like a assembly first.
What does that mean? Only 900 sequences generated results or you got a total of 900 hits for all these queries. If you truly have EST's and a reference genome then you should consider using a specialized aligner like GMAP which is meant for EST's.