Question

how to search for patterns in proteins in all prokaryotic genomes

0

Entering edit mode

8.2 years ago

Almeisan • 0

I want to write an algorithm to find a certain protein motif. I want to download the protein genome of a random bacteria, save the accession number of that genome. Then see if the genome contains one or more motifs that match my search conditions. If they do, save that motif with the accession number so I can find it back later.

Then move on to the next genome. Eventually to exhaust all prokaryotic genomes.

I think I want to use python or biopython for this. But this is a lot more confusing that I anticipated. When I search for a method, I get examples/suggestions using ftp, using entrez, using refseq, using the programmatical API of the database in question. What do I use?

I so want to download everything manually, then use C++ to program the algorithm. At least then I know what I am doing. But that should be the wrong way of going about this.

python protein • 1.8k views

ADD COMMENT • link updated 8.2 years ago by colindaven 7.0k • written 8.2 years ago by Almeisan • 0

0

Entering edit mode

When possible use a proven tool. Fuzzpro from EMBOSS for this case. Unless programming your own is an assigned task.

Genomes are available here.

ADD REPLY • link 8.2 years ago by GenoMax 148k

0

Entering edit mode

That doesn't explain how I download genomes automatically.

I tried to compile biopython examples. Example:

from Bio import Entrez
handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD")
record = Entrez.read(handle)
handle.close()
record["Count"] >= 2
"156535671" in record["IdList"]

It gives UnboundLocalError: local variable 'url' referenced before assignment for record = Entrez.read(handle)

Google says this error is a bug, suggesting you should update Biopython to the latest version. I did apt-get so I assume I have the latest version. When I look for a way to check what version I have, I find there is no way to even check Biopython version. What?

Nice the manual ftp link. But I have to click 10 times into 10 subdirectories to get a protein fasta file.

Also, when I look for Fuzzpro, it seems missing, while Fuzznuc is there: http://biopython.org/DIST/docs/api/Bio.Emboss.Applications-module.html

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

I would not assume using "apt-get" to provide the latest version. That would depend on the distribution you have installed. I am unfamiliar with python/biopython but a way to know the version installed might be checking the version of the package installed with apt-get.

ADD REPLY • link 8.2 years ago by ddiez ★ 2.0k

0

Entering edit mode

The compile error is fixed when I replaced Parser.py with some random online one. Seems indeed there is a broken version of biopython inside the Ubuntu repositories.

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

Use this recipe to download all genomes automatically: C: Download All The Bacterial Genomes From Ncbi

ADD REPLY • link 8.2 years ago by GenoMax 148k

0

Entering edit mode

I found a file called All Archaea Bacteria.gene info.gz that seems to contain accession numbers to a lot of genes. Seems I will be able to go line by line through that file and use the accession numbers with efetch to get the fasta files.

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

Or you could just get base urls from the 20th column of this file and wget *_protein.faa.gz for each genome.

ADD REPLY • link 8.2 years ago by 5heikki 11k

0

Entering edit mode

Maybe I am not using the right approach, but what I tried now is going wrong in a confusing way.

Example, I have a file with gene ids. For example, 1246500. When I search that manually in ncbi, I get the right gene.

But when I do:

handle = Entrez.efetch(db="protein", id=accession, rettype="fasta")
record = SeqIO.read(handle, "fasta")

And I print that, it is some C.elegans gene. Apparently that has a gi of gi:1246500. But gi no longer exist? The actual accession numbers of my gene are NC_001911.1 and AAD12597.1. But I cannot efetch that with the gene id?

So these gene id are useless?

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

NC_001911.1 is a plasmid sequence accession #. You are not going to get protein entries for the whole plasmid. AAD12597.1 is a valid protein accession and should work.

ADD REPLY • link 8.2 years ago by GenoMax 148k

0

Entering edit mode

Sure, but I don't have a 250 mb file with protein accession numbers. I have them with gene id.

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

I can use this link: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1246500&retmode

And get the right sequence.

But when I do this in biopython

handle = Entrez.efetch(db="gene", id=accession)

It finds no records.

Seems this gene ids indeed are useless

ADD REPLY • link 8.2 years ago by Almeisan • 0

0

Entering edit mode

As you are aware GI (GeneInfo identifier to be correct) numbers were done away by NCBI in September (in external records as far as I know, they may still be in use internally at NCBI and that is why your query works).

Is GI's all you have access to?

ADD REPLY • link 8.2 years ago by GenoMax 148k

score 0 · Answer 1 · 2016-10-17

0

Entering edit mode

8.2 years ago

colindaven 7.0k

I would take the brute force of downloading everything you need from the NCBI FTP site using wget.

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

See the proteins in the .faa.gz files. This way you shouldn't get any C. elegans hits.

ADD COMMENT • link 8.2 years ago by colindaven 7.0k