How can i summary my batch blast result to check the species distribution within my samples?
2
0
Entering edit mode
6.7 years ago
shane • 0

Hello everyone, i am new here. Recently i am learning RNA-seq analysis with some mouse cell ,but one of my samples got a very low alignment rates with tophat, then i want to use blast to check the original fasta file to identify is there any contamination like bacterial or other species. however the original file is huge,so i just make a random sample (100 reads) and upload this fasta file to the blast.

However,the alignment results from the batch blast was scattered. and i don't know how to make a summary just for the species distribution analysis. i have checked the local blast and other alignment software,but none of them supply with the taxonomic profiling information. so is there some tools can help me with this problem? or i need to change my strategy?:)

thanks for your attention about this problem.

alignment taxonomic profiling species blast • 3.0k views
ADD COMMENT
0
Entering edit mode

If you have a "species distribution" instead of just mouse then you may have a much bigger problem on your hand. It may not be wise to use these samples if the predominant data (> 95%) is not mouse. If you have a small amount of contamination you can bin the mouse reads using bbsplit.sh from BBMap suite like this:How to remove contamination from NGS data

ADD REPLY
0
Entering edit mode

First of all, you should not use tophat anymore for read alignment!!! not my words but those of the developer of tophat, see also here

ADD REPLY
0
Entering edit mode
6.7 years ago
Shred ★ 1.5k

While working with huge file, go locally. My suggest is to use Rapsearch2 (https://github.com/zhaoyanswill/RAPSearch2) , making a database only with bacterial sequence (just download archives from NCBI , and make the database with prerapsearch, which is included in RapSearch package), then launch your .fasta against this DB and check which reads where assigned to bacteria/other results.

RapSearch also provide multithreading, so it'll be faster than other tools.

ADD COMMENT
0
Entering edit mode
6.7 years ago

To answer your question: as stated by danilo.tatoni , yes you should go local ! I do however not agree with the suggestion of downloading a bacterial only DB. This will seriously bias your analysis and result in lots of false positive bacterial best hit matches. What you need is the non-redundant DB of NCBI (nrprot), it's the biggest you can get (downloading will take a while ;) ) but that's what you need as it holds all possible proteins of many different organims. The best hit against this kind of DB will most certainly point you to the true assignment.

Yes, rapsearch is an option, others are DIAMOND, PLAST, ... you should consider those as the normal blastX approach will take up lots of time, especially if you have many query sequence.

What you need to do, from a practical point of view, is to run a blast (or other) , get output in tabular format, get for each query the best hit and then lookup the taxonomic assignment for that hit. I assume that the 'lookup the taxonomic' part is the hurdle here. In that case, if you have blast+ installed and you also have the taxid-db you can easily get the taxonomic assignment for a certain proteinID, all local without need to remotely query NCBI's taxonomyDB.

The approach you're trying to do is also often called 'taxonomic binning' , perhaps there are other tools available that do the whole analysis in one go?

ADD COMMENT
0
Entering edit mode

By increasing e value cutoff and considering only high identity matches, lots of false positive will be cutted off . We're talking about prediction, so every software could consider false positive.

ADD REPLY
0
Entering edit mode

true to some extent but still not a good approach. The key thing is is exactly that you have a whole range of taxonomic information in the dataset, otherwise there is always the chance you will miss out on the "correct" hit as it might just not be present in the blastDB.

You are biasing you analysis form the start so how do you expect to get an unbiased result in the end?

Do use the nrprot version and NOT some subset!!

ADD REPLY

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6