Hi all,
I'm new with Biostars - so, sorry if anything.
I'm also new with local blast and my question might be a bit "stupid" but I've already spent almost two days trying to solve the problem and totally have know idea how to do that... I've already learned BLAST manual pages, googled the subject and read several relevant topics in here and on Seqanswers as well, but haven't got any success.. I would be very appreciative for any help.
So, the subject is: I'm trying to "restrict" nt BLAST database to perform search of my queries (several millions of short reads) against sequences which belong only to certain taxon(s). My pipeline to do that was based on this topic Vertebrate Subset Nr Database? Build My Own? and included such steps:
Getting id of taxon of interest using NCBI Taxonomy Browser. E.g. it was taxon Chlorophyta. According to the browser the taxon has id as 3041 (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3041&lvl=3&lin=f&keep=1&srchmode=1&unlock)
Getting GI list of genes associated with the taxon using NCBI Nucleotide database. I've performed search with the phrase "txid3041[Orgn]" in Nucleotide part of NCBI. It said "Found 1052995 nucleotide sequences. Nucleotide (427919) EST (569265) GSS (55811)". Ok, I've exported the GI list via "Send To" -> "File" -> "GI List" and got text file Chlorophyta_nucleotide.gi.txt with list of numbers (guess gene IDs (gids)):
916534476 916534474 916534472 916534470 916534468 916534466 ..
Further using blastcmd I've tried to extract from my local nt base sequences matching to the list of gids with the following command:
blastdbcmd -db nt -dbtype 'nucl' -entry_batch ../Chlorophyta_nucleotide.gi.txt -out nt_tst.fa
BUT got a lot of errors like:
Error: XXXXXXXXX: OID not found
Second attempt was with redirecting of errors into separate file
error.log
(-logfile
option)The result is:
427919 - lines in file with gids (Chlorophyta_nucleotide.gi.txt) - it's in agreement with the report of search in Nucleotide database (see above) 287257 - lines in file error.log (all of them looks like "Error: XXXXXXXXX: OID not found") 140662 - lines starting with ">" in output file nt_tst.fa 140662+287257=427919 - seems that far from all entries in list of gids have been found in nt database
After some time with google I've learn that the problem could be connected with preparing of local nt database: if you are using manually prepared nt database (downloading fasta files with nt base followed by preparing of the base with makeblastdb script) it's important to use
-parse_id
option to escape the problem "OID not found" in future.But I downloaded pre-formatted version of nt base from NSBI's ftp. According to their README pre-formatted nt base don't need further preparation before making a search (i.e. it's "ready-to-use"). Anyway to be sure that my nt base in "good shape" I performed simple blastn - it works.
Further check:
blastdbcmd -db nt -info Database: Nucleotide collection (nt) 30,527,720 sequences; 95,485,076,457 total bases Date: Jun 17, 2015 3:03 AM Longest sequence: 774,434,471 bases Volumes: /media/RAID/blastdb/nt.00 /media/RAID/blastdb/nt.01 /media/RAID/blastdb/nt.02 /media/RAID/blastdb/nt.03 .. /media/RAID/blastdb/nt.29
cat /media/RAID/blastdb/nt.nal # # Alias file created 06/17/2015 03:15:04 # TITLE Nucleotide collection (nt) DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29" NSEQ 30527720 LENGTH 95485076457
As I'm understanding - everything is ok with the database.
Might be "Nucleotide" database is not equal "nt" database? As I'm understanding (based on the descriptions of the databases) nt base should be equal or even bigger than Nucleotide one:
nt (the description obtained via "?" button in web version of blast)
"Title:Nucleotide collection (nt) Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry. Molecule Type:mixed DNA Update date:2015/07/14 Number of sequences:31076527"
Nucleotide (the description copypasted from main page of Nucleotide part of NCBI http://www.ncbi.nlm.nih.gov/nuccore)
The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery."
As for non-redundancy of nt database there is some discrepancy with description of "downloadable" version of nt database on NCBI's ftp:
nt.*tar.gz | Partially non-redundant nucleotide sequences from
all traditional divisions of GenBank, EMBL, and DDBJ
excluding GSS,STS, PAT, EST, HTG, and WGS.
It seems "my" nt base could be non-redundant and according that should not be smaller than Nucleotide one. Consequently, I suppose that nt base should contain all entries from my gid list.
To be sure that everything is ok with my GI list I performed the same steps with another list of gids (taxonomy id: 2 from NCBI's Taxonomy Browser) and got the same result: most gids haven't been found in nt base (""Error: XXXXXXXXX: OID not found").
So, the questions are as following:
- Is everything ok with my local nt base? Should I check something else to be sure?
- Was my "restriction" pipeline wrong in some steps? Do you know more effective way how to perform search for big list of queries against nt-database-sequencies which belong to organisms from certain taxon? Restrictions are: queries are some mix of sequences from a number of non-model organisms.
- Is it actually normal situation with getting the error "OID not found" if the pipeline correct? Should nt database contain all gids from my list(s)?
Would be very appreciative for any help.
Thank you for your attention )
PS:
The overall task is to perform "decontamination" of short-reads array before de-novo genome assembly step. We are currently have no idea about list of "contaminating" organisms, except of consideration that our target organism is eukaryotic and "contaminants" are prokaryotic. The amount of "contaminating" reads is quite big - roughly 30% or even more.. Might be there is some other effective way to perform the "decontamination"?
Thank you for your answer. I saw previously the cookbook page but haven't tried it yet cause of some misunderstanding: for example, if I run the following command
Will I get list of sequences belonging to all organisms in phylum Chlorophyta (including, e.g. organisms in class Chlorodendrophyceae with Taxonomy ID 1524962) or just some subset? Do you know?
Also thank you for advise with software. I've slightly postponed to try other ways for decontamination than blast-way cause of the latter looks more clear and informative for me (who is my "contamination"). But seems you're right and blast-way much more ineffective in terms of speed. I've found the article http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2015100a.html and now going to try the software as well as DeconSeq.
The example linked is not appropriate for your use-case, it will not select subtaxa but only the exact taxon. Therefore, you need to generate a list of all relevant subtaxa from the taxonomy and then filter the gi list for those.
You're right. I think the list can be fetched from here. Or with Entrez Direct:
I would say in this case it would be better to:
and then (if enough memory):
and then proceed with
blastdb_aliastool
Thank you. I've used almost the same way with some modifications. Firstly I've manually updated nt db (Aug 12 version). Further:
gi_taxid_nucl.dmp
file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme ) which is matching gi with txidto get list of txids of Bacteria taxa I've used the file
categories.dmp
from taxcat dump (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat_readme.txt )to get full list of gis from nt db I've used
Cause of I'm not so cool to use join, small Perl script helped me to extract GIs from
gi_taxid_nucl.dmp
which have taxid presented inBacteria_taxids.txt
and further to check presence of such GIs inGI_list_nt
...Seems should work.. But. The result of (4) was the following:
Meanwhile using Nucleotide db at NCB web page and "txid2[Orgn]" query one could currently extract 19900899 GIs (http://www.ncbi.nlm.nih.gov/nuccore/?term=txid2%5BOrgn%5D ).. Whta is that?
Moreover. My local nt db:
If nt and Nucleotide dbs are almost the same then the dbs consists of more than 50% from bacterial seqs. But guess that can't be true. But even if not, I think that 6 mln of bacterial seqs from >30 mln of all nt's seqs is too small part... Isn't it?
Try it the way I spelled out and see how many sequences you get. Also, nucleotide and nt are not the same. nt is non-redundant to some extent. Also, if you check on the left from your last link, you'll see that Refseq has ca. 4M bacterial sequences. 7M bacterial seqs in nt might be about right..
Also consider the difference in count of bacteria and eukaryota in nuccore, 20M bacterial seqs, 155M eukaryota seqs:
The ratios 7/31 20/155 are not that far off and 155M still excludes archaea, viruses and unclassified. Total seq count in nuccore is right now 198,787,644 (
einfo -db nuccore
)I have use this. And at the "blastdbcmd -db nr -entry all -outfmt "%g %T" > temp.file" I got a file.
What is the problem?