How to add ptt file content to fasta header

0

Entering edit mode

7.7 years ago

Promi ▴ 10

Hey,

I just downloaded the all.faa.tar.gz and all.ptt.tar.gz from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/. Now I would like to merge the content in ptt files to the fasta header line for each corresponding protein sequence. The only thing that is common between both the file is PID gene ID. As I am beginner in Bionfiormatics, I would like to know how to execute this in python ?

Thanks :)

python fasta ptt file • 2.4k views

ADD COMMENT • link 7.7 years ago by Promi ▴ 10

1

Entering edit mode

Curious as to why are you using old archival refseq data?

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

Because new refseq includes several versions of an assembly, quite hard to manipulate. I want to setup a local BLAST database for bacterial proteins or genomes.

ADD REPLY • link 7.7 years ago by Promi ▴ 10

0

Entering edit mode

I don't think this is the right solution.

You could get current bacterial refseq genomes summary file here. Last column in the file contains direct links for the latest assembly folders of all bacterial genomes. From there it is the matter of getting the .faa files.

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

Thank you for the suggestion :)

ADD REPLY • link 7.7 years ago by Promi ▴ 10

0

Entering edit mode

These files are large as far as a FASTA goes. Can you show the first header from each file?

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Fasta header:

>gi|336319135|ref|YP_004599103.1| chromosomal replication initiator protein DnaA [[Cellvibrio] gilvus ATCC 13127]
MAQDEELSRVWGHVVTTLEESPDITQRQLAFVRLAQPLGLLDGTIILAVGNEYTKEYLETKVRAEVTSAL
GSALGRDGRFAITVDPSLVDDAPPAVRAMTSAPELGVVTDGTDERGAPNRTVPTDADTGRHERSPMLSES
AEPTRPVRETASSRRPAAEPARLNPHYLFETFVIGSSNRFAHAAAVAVAEAPAKAYNPLFIYGDSGLGKT
HLLHAIGHYAQNLYPSVRVRYVNSEEFTNDFINSISEGKAGAFQRRYREVDVLLIDDIQFLQGKEQTMEE

PTT file header: Cellvibrio gilvus ATCC 13127 chromosome, complete genome - 1..3526441

3164 proteins

Location Strand Length PID Gene Synonym Code COG Product

ADD REPLY • link updated 7.7 years ago by GenoMax 147k • written 7.7 years ago by Promi ▴ 10

Login before adding your answer.