Pathogen protein sequence alignment
0
0
Entering edit mode
3.1 years ago

How do I filter by taxonomy (i.e. taxid = 2 for Bacteria) using NcbiblastpCommandline in Biopython? Is there a way to do it without downloading and filtering the nr dataset manually? Are there available and downloadable protein dataset that only involves bacteria and viruses? Thanks!

nr ncbi biopython protein • 1.3k views
ADD COMMENT
0
Entering edit mode

What do you mean by filter taxonomy?

If you are OK with getting bacterial viral genomic proteins then you can use answers in this thread: How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?

To get a non-redundant list you will need to download nr database and then create the fasta file for just bacteria: How to best get ALL Bacterial proteins from NCBI

ADD REPLY
0
Entering edit mode

Hey GenoMax, thanks for the reply. My plan is to restrict my blastp sequence alignment in bacteria database only. I have already worked with the NcbiblastpCommandline(...) in Biopython (for the first time) but the problem with using this API is that it includes all proteins from nr database. Is their a parameter for restricting the search for bacterias only e.g. taxids = 2 or something.

I've thought of doing what u suggested i.e. downloading nr database and creating a fasta file for just bacteria. But my concern is it's a 45GB file that I may not have the resource to download everything in my local computer. So I'm thinking that maybe the NcbiblastpCommandline(...) can do the work for me. Do you have any suggestion or a workaround for this? Thank you!

ADD REPLY
1
Entering edit mode

Latest version of blast+ allows you to limit searching to specific taxID's easily. Unfortunately you can't limit the search at the level of taxID 2. You will need to get a list of taxID's at genus/species level for this to work. You also can't use this limit if you are using -remote to do the search at NCBI. So your only option may be to include taxID's of hits in your results and filter them afterwards.

ADD REPLY
0
Entering edit mode

I see. This makes sense to me. Thanks a lot! :D

ADD REPLY
0
Entering edit mode

You can obtain descendent TaxIDs using ETE3 and then use these as inclusion/exclusion criteria as genomax suggested.

See here for example: https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py

ADD REPLY

Login before adding your answer.

Traffic: 1394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6