Hi,
I'm suffering from different sequence formatting problem (FASTA and PIR) , basically I'm using MODELLER and its functionality in my biopython scripts. Biopython deals with FASTA format, whereas to build a comparative model MODELLER uses PIR file to make use of structural information. I'm having a hard time to deal with this two formats. what I tried to do is first I obtain two sequences in FASTA format and then do
aln.append(file = 'file.fasta', align_codes='all', alignment_format='FASTA')
then after that I did:
aln.write(file='5fd1_1fdx_output.fasta', alignment_format='FASTA')
aln.write(file='5fd1_1fdx_ouput.pir', alignment_format = 'PIR')
and used the latter (5fd1_1fdx_ouput.pir
) to build the model. but it's not working since I'm losing information whenever I convert from FASTA to PIR.
So the input FASTA format file is(5fd1_1fdx_sequence.fasta
):
>5fd1
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER
>1fdx
AYVINDSCIACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED-----------------
-------------------------------
and the output file (5fd1_1fdx_ouput.pir
):
>P1;5fd1
sequence:: : : : :::-1.00:-1.00
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*
>P1;1fdx
sequence:: : : : :::-1.00:-1.00
AYVINDSCIACG--ACKPECPVN-IIQG-SIYAIDADSCIDCGSCASVCPVGA----------------------
-------------PNPED-------------*
I need a way in python or biopython to convert between these two file formats and not losing information. it is important that the output in the PIR file to be in this form:
>P1;5fd1
structureX:5fd1:1 :A:106 :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*
>P1;1fdx
sequence:1fdx:1 : :54 : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00
AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED-----------------
-------------------------------*
As you can see information is lost in the second line for each sequence. Does anyone know how to convert between these formats without loosing information? Thank you.
Biopython don't support modeller -pir format actually (at least to write it) the link points to EBI format which is substantialy different from the format of MODELLER even if they share the name.
How I hate when this happens! I remember this being the case of BED formats as well - one a tab separate plain text, the other a binary file. Don't use duplicate names, people! </rant>
why it's not possible to create PIR format from an exported FASTA?
It's been more than 4 years, so I might be losing context here, but it looks like exported FASTA has less information content than the PIR, which is probably why I said it was not possible. Technically, it might be possible but the PIR may end up with a lot of blank fields.