Hi,
I have a single fasta file containing 3136 partial 18S rRNA gene sequences (on average 235 nucleotides long and never longer than 260 or shorter than 185 nt) for which I would like to get the top 10 blast hits against the nt database in table format. Preferably I would also like to get their source organism taxonomy (GB field) in the same table, but let's consider that optional for now.
I am contemplating what would be the best strategy for this.
I would prefer not to have to download the entire nr database to my computer.
Therefore I consider currently two strategies:
- Use the NCBI BLAST+ suite as described here: http://www.ncbi.nlm.nih.gov/books/NBK1763/ under BLAST+ remote blast, the issue is that here it is described for only a single sequence submission and for my number of sequences quite the number of RID's get automatically generated, which I do not want to format manually afterwards. But I am afraid that there's no way to avoid this?
- Alternatively, I could use bioperl or biopython to run a remoteblast loop and try to format the output appropriately
Which of these two strategies would be most efficient?
Any pointers are warmly welcome...
Kind regards.
FM
edit: the sequences are fungal 18S reads of which consensus sequences for OTU's were obtained with mothur. Half of the original dataset could not be classified deeper than "eukaryotes" with SILVA or RDP. Therefore, I am looking to the closest BLAST matches in the nt database, maybe I could also consider the env_nt database but first I'd like to check out the nt.
Hi, I clearly made a few mistakes in writing down my question. I am actually blasting fungal 18S consensus sequences obtained by denoising and cleaning up raw 454 flowgrams with mothur and getting the representative sequence for each otu. The issue is that the common databases such as SILVA, GreenGenes and RDP fail to properly classify half of my dataset, so I am looking for the closest blast match with the nt database, not the nr (my bad, I must have typed it wrongly).
I am very wel aware of the mothur and qiime guides and prefer mothur myself for analysis. I am also aware that I can use blast as a alignment engine in mothur but not for classification or remote blasting.
But thank you for helping me get my question more accurate!
Ah, ok. With fungal ITS best-hit blast assignment makes sense. Your reference db should then be UNITE. There's a QIIME tutorial for fungal ITS here. Mothur has a dedicated site for UNITE as well.
Thanks for the tip but my primers don't target ITS, they target 18S (The primers used were: NS1 5’-GTA GTC ATA TGC TTG TCTC and Fung 5’-ATT CCC CGT TAC CCG TTG). Hence I need a SSU reference database such as SILVA, which I already tried and failed to classify half of my sequences.
If you've already run a typical analysis for your goals and you are seeing less than 50% classification, it might be worth checking that there isn't any contamination or problems in your data.
Blast might be a good start, you could hit the nt database and only search against fungal sequences. Remove reads without significant hits (e.g. expect > 1e-10) and run your analysis again to see if the rate improves. It'd be better to search against the whole nt database and exclude anything with either non-significant hits against fungi or significant hits against non-fungi species, however this might be problematic with running blast remotely.
Are you sure you can't run it locally? As long as you have enough ram to load the database, you should not have any trouble. If you have enough memory, you can run multiple instances of blast simultaneously.
I will investigate this possibility. I have 32 Gb of RAM, I hope that will be enough to load the nt database.
How would you encode your conditions (i.e. exclude anything with either non-significant hits against fungi or significant hits against non-fungi species) in a local blastn search syntax-wise?
Thanks in advance.
I forget how large the nt database is, just start a local query with the nt database and see how much memory it uses up.
BLAST doesn't have much in the way of filtering results, I usually handle this later in SQL.
In the documentation, for tabular and CSV formatted results, you can have taxonomic information for the subject stored with the results. It looks like you can store the super kingdoms for hits, as well as the scientific names. So this should provide an axis to select on, then rank by your blast hit metric of choice.