Entering edit mode
7.7 years ago
Promi
▴
10
Hey,
I just downloaded the all.faa.tar.gz and all.ptt.tar.gz from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/. Now I would like to merge the content in ptt files to the fasta header line for each corresponding protein sequence. The only thing that is common between both the file is PID gene ID. As I am beginner in Bionfiormatics, I would like to know how to execute this in python ?
Thanks :)
Curious as to why are you using old archival refseq data?
Because new refseq includes several versions of an assembly, quite hard to manipulate. I want to setup a local BLAST database for bacterial proteins or genomes.
I don't think this is the right solution.
You could get current bacterial refseq genomes summary file here. Last column in the file contains direct links for the latest assembly folders of all bacterial genomes. From there it is the matter of getting the
.faa
files.Thank you for the suggestion :)
These files are large as far as a FASTA goes. Can you show the first header from each file?
Fasta header:
PTT file header: Cellvibrio gilvus ATCC 13127 chromosome, complete genome - 1..3526441
3164 proteins
Location Strand Length PID Gene Synonym Code COG Product