Converting Genbank To Fasta In Protein Form
4
2
Entering edit mode
13.6 years ago
Francis ▴ 60

So i have a sequence that is a .gb file. What I want to do is parse and change the format of the file. I've figured out how to parse it to FASTA format, although the sequence that is in the FASTA format is nucleic and i want it to be PROTEIN.

kind of stuck here...

any ideas?

import Bio
from Bio import SeqUtils
from Bio import Seq
from Bio import SeqIO

handle = 'sequence.gb' 
output = 'sequence.fasta'
data = Bio.SeqIO.parse(handle, 'gb')

fh = open(output, 'w')

for record in data:
    convert = Bio.SeqIO.write(record, output, 'fasta')
    dna = record.seq
    mrna = dna.transcribe()
    protein = mrna.translate()
biopython python structural • 13k views
ADD COMMENT
5
Entering edit mode
13.6 years ago
David W 4.9k

Hi Francis,

If I understand, you want to take a .gb file with nucleotides, and write a new file with the protein sequences in fasta? In Biopython sequences/records don't have to have a particular format until they're written (they are a python object) - so do the conversion from DNA to protein first, then write them out in the format you want. You also don't have to go DNA -> mRNA -> protein, if there is a DNA alphabet seq.translate() works fine

I don't have Biopython on the computer in front of me, so this is untested:

EDIT

The translate method belongs to sequences, rather than the record that holds information about the sequences and their ID, description etc : SeqRecord. So, you need to define a function that takes each record and makes a new SeqRecord with the translated sequences:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def translate_record(record):
  """Returns a new SeqRecord with translated sequence."""
  return SeqRecord(seq = record.seq.translate(), \
                   id =  record.id + "_translated", \
                   description = record.description)

infile= 'sequence.gb' #'handle' usually refers to an open() object
output = 'sequence.fasta'

records = Bio.SeqIO.parse(infile, 'gb')

#start an empty list, which we'll fill with protein seqs
proteins = []
for rec in records:
  proteins.append(translate_record(rec))

SeqIO.write(proteins, open(output, 'w'), 'fasta')

If you prefer, you can use generator expressions in place of the for loop (that way you don't have to load the whole list into memory):

proteins = (translate_record(r) for r in records)

You should check out the Biopython wiki for some examples of Seq and SeqIO

ADD COMMENT
0
Entering edit mode

it keeps returning me with the error saying" SeqRcord object has no atttribute 'translate'

ADD REPLY
0
Entering edit mode

Oh right, knew I should have checked it before I posted it. translate is a method on Seq() so in this case it's rec.seq.translate. I'll edit this with tested code later, but for now see this bit of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement.

ADD REPLY
0
Entering edit mode

wonderful!! thanks so much!

ADD REPLY
0
Entering edit mode

Just to let you know, I'll be borrowing this code for your function if you don't mind.

ADD REPLY
0
Entering edit mode

Francis, of course you can use that snippet. I would say, it will be of more benefit to you in the long run if you play around with it in an interactive session - use print dir()and help() to work out what each piece of the code is doing

ADD REPLY
4
Entering edit mode
13.6 years ago
Neilfws 49k

Basically, the trick is to extract features of type "CDS". Most GenBank files should contain these features. If they don't, you'll have to employ translation, as you've started to do.

For files with CDS features, take a look at Peter Cock's tutorial, Converting GenBank files to FASTA format with Biopython. He shows example code in the section titled "Gene by Gene : GenBank to FASTA Amino Acids (.gbk to .faa)."

ADD COMMENT
3
Entering edit mode
13.6 years ago

Dealing with Genbank files is always trouble :o). It very much depends on what data you want from the file. Here is how to this with Biopieces www.biopieces.org):

In this approach I parse the "source" from the GenBank file which is the entire sequence in nucleotides - which I then translate to protein - define a sequence name and output as FASTA:

read_genbank -i test.gb -k AC,SEQ -f source |
translate_seq -f 1 |
rename_keys -k ACCESSION,SEQ_NAME |
write_fasta -w 60 -x

The resulting output:

>U49845
DPPYTTVSPPQV*ISTTEPLPT*DS*VSSRVTS*NEQ*SALHLKPLKFY*GWITSSVQDQ
EPPIDNICNIFRIYLENNKPPHCHYYN*KQNAKIIHYIIQRREKKRTTRHRTFGNSRHK*
ILATYVSSSSSTRALSQECNNTHRRYG*R*HLHNLKAPCRESPSFVE*FSLFI*ELIFLF
FTLTSCSD*HCNSHHH*KNRTIT**KNYIFLETISCFQHLRISRSIHLP*HSFRFHYC*Q
LLYHYSI**WPRPMRHILSENNTPQWQESMNRLHFKFPMIPINRL*TRQLK*HTIASTYR
AGFRLTLVLERSQVNLLLTYYLMRTPRCISM*YSRVRTLPTARL*TIHTNLLLQTVHPSR
YRQISIYWRC*KTMVILTAKTL*N*ILMKSST*LLTVQCSLTKNPLCRITDVLSCIMRRY
PIGCSSILAS*SLLGRHR**TRRLLQKQATVLSSSLQTLKDFLPLR*NSN*SSGLTS*LP
LFKIV**STLLTQVTFHMTYL*TMFISMTILFLLINWVL*TYWMLQTGWH*IMLPFPGLS
QMNYSVRTPILPIFLCPFMILMVM*FISTSKLSPQRICLPLVLFPILTLQGVNGSPTIFC
LLSLQTT*IQTFH*SLLIQAKTMTG*NSNHLI*H*LEKCPRISTSFH*V*KRTKVHNLKS
YILTSLAWIQR*LTQTTVRMQRPQEVLTTPPQQVLTHLLLTLQKFLLPPLLLLLLLQQRC
QQPIKLHLTIKKQ*QLRAVLLSH*ALS**LSFAS*YSGDAEGKIQTMKTYRMLLVDLI*I
ILQINQIKKTLHL*TTPLMMMLPRTMILQ*QEDWLL*TL*NWITTLPLNLIFPAWMKREI
LYQV*IHTMISSNPKVKKNY*QNPQYSLQRARSLTHRIGLLLCIWIVNQQ*INPGDILAT
CHQSLILSETVTDHKKLLIQKNFSI*KHQRRKNVRQGMSLCLHWTLGTAILALLP*ENQ*
HHHHIT*RSIVTATYKIFKTLKAVKTESLPQQCQLHLLTILFRLKMVKIFAGSIAWNQTE
DQVRKG**IFQIRVMSMLVKLRTFTDASQKCCDYTQRYFA*FYFPVLFFISGLQIPYILF
SFYT*RHLILIPFFKFHFCT*NKDPKMLSPSSY*EYTPFKILSSPLINFSLN**IIKGPT
SEPTKEVSFILGG*KPLLSGKFSSS*HLTQFESLSISAFSSKLSTLLFLSNLCPSSNSIA
LITASNVIVSSLTLGNFSKCIIKLFKEDRNSSNTSVSVMI*SSLSTCCNSLKSKTYFSMH
KSFFLLIMQMENL*TCVNLERTSSISSSI*SIKAGCLLMGTNCGKLNDW*VV*SND*GGY
TFL*NKIKLM*HFKYTLSHFSTHLFIKLTQRLLFFFSSWISVVAKTYTFFFRPFF*LSGK
VYIS*TGSSLSVKASGFD*LILRKWKLN**CRRICICISRLFMFLRTFDL*QGEKKYILF
FGKGESIM*KLE*NGRNKERLSSSFFQKAPNDNN*NEKDLPSVSNISCVSNNKIITSVAF
SAFVVCIFRNFSLINGNHKFSNELAISSNSF*ASSYLLWNSSHFFSHSSLSSSKATILLP
ICSEFKSASFSLSIASFSLASLSSSCCSRSWFFLV*FSLLDLKLLESSANCFVSDN*LSN
FSTSLSSCSFLADKDLISFSFSVLDCSNSLSCSLSSSYFSCHDSDSNFKLFNFSL

Alternatively, you can parse the tranlations from the feature table like this:

read_genbank -i test.gb -k AC -f CDS -q translation,product |
merge_vals -k ACCESSION,S_BEG,S_END,PRODUCT |
rename_keys -k ACCESSION,SEQ_NAME |
rename_keys -k TRANSLATION,SEQ |
write_fasta -w 60 -x

And the output:

>U49845_0_205_TCP1-beta
SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPR
TANRQHM
>U49845_686_3157_Axl2p
MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKT
AQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTY
QFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSY
YGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEF
ELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWV
ALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINA
TRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLG
LKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTS
AAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDEN
LPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATE
SDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEP
AVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNS
NISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGEN
FCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML
>U49845_3299_4036_Rev7p
MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEE
LILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSS
LNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKC
QEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESI
FGSLF

Cheers,

Martin

ADD COMMENT
1
Entering edit mode
12.8 years ago
Audriusa ▴ 10

If you do not need to put this into some automated pipeline (occasional conversion), there is a suitable JavaScript-based converting service on a web.

ADD COMMENT

Login before adding your answer.

Traffic: 2119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6