Dear colleague,
I am working on the analysis of prokaryotic genomes from NCBI genome database.
- Downloaded a file called prok_representative_genomes.txt from the following file ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt
After opening the file, we could see one column named "Chromosome RefSeq". (e.g., NZ_AQXM00000000)
- Download all protein sequences for all bacterial genomes from https://www.ncbi.nlm.nih.gov/assembly.
Each file has a name like "GCF_000834735.1_ASM83473v1_protein.faa.gz".
It is odd that the two datasets use different accession number system. In this case, how to identify if the genome of the proteins is annotated in the file "prok_representative_genomes.txt"?
My aim is to retrieve all the protein sequences for the genomes listed in the file "prok_representative_genomes.txt".
Thanks a lot,
Kind regards
Tom
Thanks,
Does anybody know why there are some 0 values in the "Genomes" column of the file prok_representative_genomes.txt? For example, "Acetobacter aceti".
The README says it's manually updated. I tend to believe you'll see a more complete picture when using the
prokaryotes.txt
(computationally updated...)