I want to write an algorithm to find a certain protein motif. I want to download the protein genome of a random bacteria, save the accession number of that genome. Then see if the genome contains one or more motifs that match my search conditions. If they do, save that motif with the accession number so I can find it back later.
Then move on to the next genome. Eventually to exhaust all prokaryotic genomes.
I think I want to use python or biopython for this. But this is a lot more confusing that I anticipated. When I search for a method, I get examples/suggestions using ftp, using entrez, using refseq, using the programmatical API of the database in question. What do I use?
I so want to download everything manually, then use C++ to program the algorithm. At least then I know what I am doing. But that should be the wrong way of going about this.
When possible use a proven tool. Fuzzpro from EMBOSS for this case. Unless programming your own is an assigned task.
Genomes are available here.
That doesn't explain how I download genomes automatically.
I tried to compile biopython examples. Example:
It gives UnboundLocalError: local variable 'url' referenced before assignment for record = Entrez.read(handle)
Google says this error is a bug, suggesting you should update Biopython to the latest version. I did apt-get so I assume I have the latest version. When I look for a way to check what version I have, I find there is no way to even check Biopython version. What?
Nice the manual ftp link. But I have to click 10 times into 10 subdirectories to get a protein fasta file.
Also, when I look for Fuzzpro, it seems missing, while Fuzznuc is there: http://biopython.org/DIST/docs/api/Bio.Emboss.Applications-module.html
I would not assume using "apt-get" to provide the latest version. That would depend on the distribution you have installed. I am unfamiliar with python/biopython but a way to know the version installed might be checking the version of the package installed with apt-get.
The compile error is fixed when I replaced Parser.py with some random online one. Seems indeed there is a broken version of biopython inside the Ubuntu repositories.
Use this recipe to download all genomes automatically: C: Download All The Bacterial Genomes From Ncbi
I found a file called All Archaea Bacteria.gene info.gz that seems to contain accession numbers to a lot of genes. Seems I will be able to go line by line through that file and use the accession numbers with efetch to get the fasta files.
Or you could just get base urls from the 20th column of this file and wget *_protein.faa.gz for each genome.
Maybe I am not using the right approach, but what I tried now is going wrong in a confusing way.
Example, I have a file with gene ids. For example, 1246500. When I search that manually in ncbi, I get the right gene.
But when I do:
And I print that, it is some C.elegans gene. Apparently that has a gi of gi:1246500. But gi no longer exist? The actual accession numbers of my gene are NC_001911.1 and AAD12597.1. But I cannot efetch that with the gene id?
So these gene id are useless?
NC_001911.1 is a plasmid sequence accession #. You are not going to get protein entries for the whole plasmid. AAD12597.1 is a valid protein accession and should work.
Sure, but I don't have a 250 mb file with protein accession numbers. I have them with gene id.
I can use this link: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1246500&retmode
And get the right sequence.
But when I do this in biopython
It finds no records.
Seems this gene ids indeed are useless
As you are aware GI (GeneInfo identifier to be correct) numbers were done away by NCBI in September (in external records as far as I know, they may still be in use internally at NCBI and that is why your query works).
Is GI's all you have access to?