Is There A Query Parameter For The Online Blast That Would Filter For Hits That Are Unique To A Specific Taxon?
3
1
Entering edit mode
11.6 years ago
James Ashmore ▴ 100

Hello,

I am running blastn, blastx and tblastx searches on NCBI's nt, est, nr and HTGS databases using transcriptome data containing ~56,000 contigs. I have been able to produce biopython scripts to run these searches with only non-matching blast queries retrieved. Now I would like to retrieve blast queries which only show hits in the taxon 'caudata' i.e.protein-coding transcripts unique to urodeles. Is there a specifc boolean query I can put in the entrez query parameter of the qblast function which will perform this? Or will I have to do something more intensive such as perform the search specifc to each taxon and find the queries which only have hits in the caudata taxon.

Thanks for any help,

Regards, James

biopython taxonomy blast transcriptome entrez • 8.0k views
ADD COMMENT
0
Entering edit mode

Could you clarify if you are doing this via the standalone legacy BLAST tools (i.e. binary blastall), the standalone BLAST+ tools (i.e. binaries blastn, blastx, tblastx) and if so, has the database has been installed locally on your computer or you are using the -remote option to run the search on the NCBI servers? Thanks!

ADD REPLY
0
Entering edit mode

Hi Peter, I'm doing this via Biopython and using the NCBIWWW.qblast function to run the searches. What I've since discovered is that I'll probably have to retrieve the taxon ID from the gi numbers of the blast hits and then script some sort of condition saying if the signifcant blast hits for this query contain taxon ids just from caudata, keep query, otherwise remove. Please also see my reply to jordan below for a better explanation of my problem. Thanks

ADD REPLY
1
Entering edit mode
11.6 years ago
Jordan ★ 1.3k

I don't understand the question right. But if you are using online blastn, then there is a subsection in it called Choose search set which has an option for specifying which organism you want to include for the search. Here you can just mention the taxon id of your interest.

ADD COMMENT
0
Entering edit mode

Hi, thanks for the help but I feel I may not have expessed the question correctly.

For example, say Contig 1 only has a hit in the caudata taxon because the protein it produces is caudata specific. Where as the protein Contig 2 produces can be found in many taxons as it is a univeral protein needed for general organism growth. Is there a way to filter the blast results to retrieve the queries which only have hits in the caudata taxon? I am using biopython to perform the BLAST searches and parse them.

Thanks, James

ADD REPLY
1
Entering edit mode
11.6 years ago
Vitis ★ 2.6k

Before the rollout of blast+ package, I used blastcl3, which is a command-line version of NCBI blast. I could set up the taxon option using a syntax like '-u "Arabidopsis[organism]"' to restrict the search within certain a certain taxon.

ADD COMMENT
0
Entering edit mode
11.6 years ago
Peter 6.0k

Since you are using the QBLAST API, just pass that the Entrez search term - something like this where the ... would be whatever you're already using to call QBLAST:

from Bio.Blast.NCBIWWW import qblast
handle = qblast(..., entrez_query='caudata[organism]')

Note if you need quotes in the Entrez search term like "Paramesotriton wulingensis"[organism] you need something like this (Python's flexibility with single and double quotes for string delimiters is useful here):

from Bio.Blast.NCBIWWW import qblast
handle = qblast(..., entrez_query='"Paramesotriton wulingensis"[organism]')

However, if you plan to search over 50,000 contigs you really shouldn't be using the online NCBI BLAST like this, but instead running standalone BLAST+ locally.

P.S. For more fun with Entrez search fields, see http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/

ADD COMMENT
0
Entering edit mode

Hi Peter,

The method you suggest would retrieve hits from the caudata taxon only, however I want to retrieve hits from all taxons and filter for queries (not hits) which only have hits in the caudata taxon - i.e. unique caudata genes. My goal is to take the salamander transcriptome and see which transcripts are salamander-specific therefore I need to BLAST the contigs and retrieve those which only have hits in the caudata taxon. I hope this clears up matters.

Thank you for your help, James

ADD REPLY
0
Entering edit mode

Oh I see. That will be harder - one way would be to do a full unrestricted BLAST search and then filter the results. With BLAST+ 2.2.28 onwards you can get the taxid in the tabular output http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html (assuming the database supports this, NR does) but I'm not sure if the BLAST XML includes this. Otherwise you'd perhaps have to filter on species names in the description which is unreliable.

Alternatively, it would be simpler to do two BLAST searches, one unrestricted and one against caudata only, and compare the two. This would be a bit slower as you are doing two BLAST searches, but the analysis should be much less complicated.

ADD REPLY
0
Entering edit mode

Your second method seems much easier to understand and I'll have to implement something like that in the future, currently I've ended up writing a script which will take the gi number from the blast hit description, convert this to the taxon ID and retrieve the full lineage of the blast hit. There i check if the string 'Caudata' is present and if so I append to a list, if the number of hits for the query is equal to the number of entries in the list then the query is caudata specific. Thanks for the help!

ADD REPLY

Login before adding your answer.

Traffic: 3073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6