Question

How can i summary my batch blast result to check the species distribution within my samples?

0

Entering edit mode

7.3 years ago

shane • 0

Hello everyone, i am new here. Recently i am learning RNA-seq analysis with some mouse cell ,but one of my samples got a very low alignment rates with tophat, then i want to use blast to check the original fasta file to identify is there any contamination like bacterial or other species. however the original file is huge,so i just make a random sample (100 reads) and upload this fasta file to the blast.

However,the alignment results from the batch blast was scattered. and i don't know how to make a summary just for the species distribution analysis. i have checked the local blast and other alignment software,but none of them supply with the taxonomic profiling information. so is there some tools can help me with this problem? or i need to change my strategy？:)

thanks for your attention about this problem.

alignment taxonomic profiling species blast • 3.6k views

ADD COMMENT • link updated 7.3 years ago by lieven.sterck 15k • written 7.3 years ago by shane • 0

0

Entering edit mode

If you have a "species distribution" instead of just mouse then you may have a much bigger problem on your hand. It may not be wise to use these samples if the predominant data (> 95%) is not mouse. If you have a small amount of contamination you can bin the mouse reads using bbsplit.sh from BBMap suite like this:How to remove contamination from NGS data

ADD REPLY • link 7.3 years ago by GenoMax 152k

0

Entering edit mode

First of all, you should not use tophat anymore for read alignment!!! not my words but those of the developer of tophat, see also here

ADD REPLY • link 7.3 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2018-03-15

While working with huge file, go locally. My suggest is to use Rapsearch2 (https://github.com/zhaoyanswill/RAPSearch2) , making a database only with bacterial sequence (just download archives from NCBI , and make the database with prerapsearch, which is included in RapSearch package), then launch your .fasta against this DB and check which reads where assigned to bacteria/other results.

RapSearch also provide multithreading, so it'll be faster than other tools.

score 0 · Answer 2 · 2018-03-15

To answer your question: as stated by danilo.tatoni , yes you should go local ! I do however not agree with the suggestion of downloading a bacterial only DB. This will seriously bias your analysis and result in lots of false positive bacterial best hit matches. What you need is the non-redundant DB of NCBI (nrprot), it's the biggest you can get (downloading will take a while ;) ) but that's what you need as it holds all possible proteins of many different organims. The best hit against this kind of DB will most certainly point you to the true assignment.

Yes, rapsearch is an option, others are DIAMOND, PLAST, ... you should consider those as the normal blastX approach will take up lots of time, especially if you have many query sequence.

What you need to do, from a practical point of view, is to run a blast (or other) , get output in tabular format, get for each query the best hit and then lookup the taxonomic assignment for that hit. I assume that the 'lookup the taxonomic' part is the hurdle here. In that case, if you have blast+ installed and you also have the taxid-db you can easily get the taxonomic assignment for a certain proteinID, all local without need to remotely query NCBI's taxonomyDB.

The approach you're trying to do is also often called 'taxonomic binning' , perhaps there are other tools available that do the whole analysis in one go?