Question

accessing sequence as columns from file

0

Entering edit mode

6.6 years ago

mdsiddra ▴ 30

I am handling a protein sequence file in phylip format using Python.

     5    592
Homo_sapie MEMQDLTSPH SRLSGSSESP SGPKLGNSHI NSNSMTPNGT EVKTEPMSSS 
Macaca_mul MEMQDLTSPH SRLSGSSESP SGPKLDNSHI NSNSMTPNGT EVKTEPMSSS 
Mus_muscul MEMQDLTSPH SRLSGSSESP SGPKLDSSHI NSTSMTPNGT EVKTEPMSSS 
Danio_reri ---------- ---------- ---------- ---------M SWILMWSLLS 
Ciona_inte ---------- ---------- ---------- ------MLFS VYIVMMIVTS 

           ETASTTADGS LNNFSGSAIG SSSFSPRPTH QFSPPQIYPS NRPYPHILPT 
           ETASTTADGS LDNFSGSAIG SSNFSPRPTH QS-PPQIYAS NRPYPHILPT 
           EIASTAADGS LDSFSGSALG SSSFSPRPAH PFSPPQIYPS -KSYPHILPT 
           ACAPQIHSAS AQDSSNLLST EEPITPQPYN RSQYCQWPCK CPKTPPMCPP 
           QFYLSMATPN FDLRRSNQST EGDFYPARS- EARECQD-CT CPDTPGTCPP 

           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTES 
           GVSLLMDG-- -----CDCCR ACAKQVREAC NEKENCDHHR GLYCDYSADK 
           GVSRIMDG-- -----CDCCK MCAKQLNEPC DVRMRCDHHK GLYCDMNT-- 

           GLSQSQSPGQ TGFLSYGTSF STPQPGQAPY SYQMQGSSFT TSSGIYTGNN 
           GLSQSQSPGQ TGFLSYGTSF STPQPGQAPY SYQMQGLSFT TSSGLYTGNN 
           GLSQSQSPGQ TGFLSYGTSF GTPQPGQAPY SYQMQGSSFT TSSGLYSGNN 
           ---------- ---------- ---------P RYEKGVCAFL PGTGCEHNGV 
           ---------- ---------- ---------- ----GLCKAS PGVACYVGGS

I need to read the file in a list such that each index contains a column wise from the sequence. Until now I have just succeeded in getting the rows of file but I need the sequences to be extracted as columns of sequence. for example, for the above file I need output as each column to be one index of the list,

M
M
M
-
-

Python • 3.3k views

ADD COMMENT • link updated 6.6 years ago by finswimmer 16k • written 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

try phylip to fasta converter (such as bbmap (standalone) or alignio (biopython library)). Then you can access the sequences as index.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Actually I need to work on phylip format.

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

You can try logic something like this.

I have just tried with alignment which you have provided in your question.

Add conditional statements to handle other things like blank space and all.

listofInputData = fh.readlines()
finalList = list()
for i in range(0, len(listofInputData[0].strip())-1):
     tempStr = ""
     for line in listofInputData:
          if(len(line)==i):
              break
          else:
              tempStr += line[i]
     finalList.append(tempStr)

ADD REPLY • link 6.6 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Thanks, I have tried using this logic and I ended up with this error:

    tempStr += line[i]
IndexError: string index out of range

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

Hello mdsiddra,

Have you checked indentation for that line?

ADD REPLY • link 6.6 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Yes , I have checked. it still gives the same error. I needed to access each column of sequence file as one index of the list so that I can use each index for further calculation.

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

Here we have to handle length of line and indexing we can solve it by adding if statement,

if(len(line)==i):
     break
else:
     tempStr += line[i]

ADD REPLY • link 6.6 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Thankyou for the help.!

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

1

Entering edit mode

6.6 years ago

cpad0112 21k

Some thing like this? @OP

from Bio import AlignIO as AO
alignments = AO.read(open("test.phy"), "phylip")
for record in alignments:
    print(record.seq[0])

M
M
M
-
-

input is from OP. OP dimension is 5x200. Each column can be accessed by column index (record.seq[0], record.seq[1], for all the columns: record.seq[1:200] .

ADD COMMENT • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

OP wants to access the entire column as a single record. Based on previous questions OP is trying to build a substitution matrix.

ADD REPLY • link 6.6 years ago by GenoMax 149k

0

Entering edit mode

Thnakyou for the help.

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

score 3 · Accepted Answer · 2018-07-12

3

Entering edit mode

6.6 years ago

finswimmer 16k

Based on your description in your edited post this should work with a minor change: As your desired output doesn't look like a valid strict formated phylip file anymore, I guess you could omit also the first line (the one with "5 592").

But again: I don't know if this all is necessary/useful to solve your underlying task.

ADD COMMENT • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

You're right about the format of the file that won't be the phylip any longer. I intended to convert the text file to phylip format later and I am working on it. Do you have any suggestions?

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

score 2 · Accepted Answer · 2018-07-11

2

Entering edit mode

6.6 years ago

finswimmer 16k

Hello mdsiddra,

if you don't want to reinvent the wheel I recommend using the BioPython module. This can parse your phylip file easily like this:

from Bio import AlignIO


alignment = AlignIO.read(open("input.txt"), "phylip")

for record in alignment :
    print(record.seq[0])

You can make a function out of this, to get a list of characters in a given column:

from Bio import AlignIO


def column(alignment, index):
    c = []

    for record in alignment :
        c.append(record.seq[index])

    return c

alignment = AlignIO.read(open("input.txt"), "phylip")

print("\n".join(column(alignment, 0)))

fin swimmer

ADD COMMENT • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Thankyou,finwimmer, it worked..! but it is giving me one column at one time , I want to run a loop for accessing all the columns one by one. For example, each column data to be at one index.

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

1

Entering edit mode

I want to run a loop for accessing all the columns one by one

Then just do so, by increment the index value you passed to the function, until the end of the alignment is reached. :)

Or build an generator like this:

from Bio import AlignIO


def column(alignment):
    n = len(alignment[0].seq)

    for i in range(n):
        c = []

        for record in alignment:
            c.append(record.seq[i])

        yield c


alignment = AlignIO.read(open("input.txt"), "phylip")

for c in column(alignment):
    print(c)

If you doesn't want to loop over each column, but get a list of columns instead, just replace the last two lines by:

c = list(column(alignment))

If this is not what you need, we need more informations about your goal.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Yes, it is exactly I wanted. Thankyou so much for this much help.

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

Hello mdsiddra,

fine if it helps.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Using the above code, I am trying to read these columns in a way that if some specific instance exits in a column , write the column to a separate list and rest of the columns to another list. for example, if the following is the file whose columns I am using and I am checking for the presence of '-' in a column (like in the first column it exists) then I would want to remove the column from the existing list and append it to a new list and do this for all columns.

     5    592
Homo_sapie MEMQDLTSPH SRLSGSSESP SGPKLGNSHI NSNSMTPNGT EVKTEPMSSS 
Macaca_mul MEMQDLTSPH SRLSGSSESP SGPKLDNSHI NSNSMTPNGT EVKTEPMSSS 
Mus_muscul MEMQDLTSPH SRLSGSSESP SGPKLDSSHI NSTSMTPNGT EVKTEPMSSS 
Danio_reri ---------- ---------- ---------- ---------M SWILMWSLLS 
Ciona_inte ---------- ---------- ---------- ------MLFS VYIVMMIVTS

Can you please help finswimmer ????

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

Whether a certain value is in a list you can check in python like this:

if "-" in my_list:
    do_something()

Do this in this loop:

for c in column(alignment):
    if "-" in c:
        do_something()

Could you please also explain what your final goal is? I just want to make your sure that you are not asking a XY-Question.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Yes , I want to use the file so that there is complete deletion of the columns which have any missing sequence character so that I may use this file either in some tests of phylogeny or in calculating the probabilistic values.

For the above code:

if "-" in c:

I used this kind of logic earlier but I retrieved only the instance and not the entire column containing the instance. So I had to seek help once again. (Moreover , as I am a new user , so I can post for a limited number of times , that is why I have to wait for 6 hours to ask any new thing)

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

0

Entering edit mode

I used this kind of logic earlier but I retrieved only the instance and not the entire column containing the instance.

I don't understand this sentence.

The if statement doesn't return anything. If the statement is true the code block is excuted otherwise not. Your column data is in c. So just append it to the list you like if an - appears in it. Also extend the code with an else where you append c to the list you like if there is no - in.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

I meant the following:

for c in column(alignment):  ### Loop over each column
    if "-" in c:
        list2.append(c)
    else:
        list3.append(c)
    print (c)

If I use this "If-else" block within for loop, it works well but the last print statement is unnecessary for me, instead I tried writing this to a new file , like this:

alignment = AlignIO.read(open("FBXO7_clustal_algn.phy"), "phylip")
g = open("New_Columns.txt", "w")

list2 = []
list3 = []
for c in column(alignment):  ### Loop over each column
    if "-" in c:
        list2.append(c)
    else:
        list3.append(c)
    g.write(c)
g.close()

But this is not working, giving error:

    g.write(c)
TypeError: write() argument must be str, not list

Also I tried this way:

c = list(column(alignment))   ### getting a list of columns
if "-" in c:
    list2.append(c)
else:
    list3.append(c)

print (list2)
print (list3)

This is better but when I try to write these lists to file, it arises error :

    g.write(c)
TypeError: write() argument must be str, not list

Also, what if instead of writing this to another file, I should remove the columns from original file?? Can this be done also?? Resulting file be with desired columns only..

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

1

Entering edit mode

Hello again,

writing to a file is a new task you haven't mentioned before. So you should show an example how this file should look like. It is much easier to write each column in a row in the new file rather than keep it in a column. I'm not aware of an easy method to write a file column wise.

Some more comments on your code:

for c in column(alignment):  ### Loop over each column
    if "-" in c:
        list2.append(c)
    else:
        list3.append(c)
    g.write(c)

Here you are trying to write the column to a new file regardless it contains - or not. Is this what you want to do?

As the error you get says, you must be sure that the data you pass to write() is a string, but you're are passing a list. You can join all elements of the list to a string like this: "".join(c)

c = list(column(alignment))   ### getting a list of columns
if "-" in c:
    list2.append(c)
else:
    list3.append(c)

Here c is a list of columns. So you are not checking for the presence of - in one column. You have to loop over the elements in c.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

My limit of adding a new post is reached, so I am explaining my task by editing the previous post :|

My input file is like this : (A sequence file of phylip format)

     5    592
Homo_sapie MEMQDLTSPH SRLSGSSESP SGPKLGNSHI NSNSMTPNGT EVKTEPMSSS 
Macaca_mul MEMQDLTSPH SRLSGSSESP SGPKLDNSHI NSNSMTPNGT EVKTEPMSSS 
Mus_muscul MEMQDLTSPH SRLSGSSESP SGPKLDSSHI NSTSMTPNGT EVKTEPMSSS 
Danio_reri ---------- ---------- ---------- ---------M SWILMWSLLS 
Ciona_inte ---------- ---------- ---------- ------MLFS VYIVMMIVTS 

           ETASTTADGS LNNFSGSAIG SSSFSPRPTH QFSPPQIYPS NRPYPHILPT 
           ETASTTADGS LDNFSGSAIG SSNFSPRPTH QS-PPQIYAS NRPYPHILPT 
           EIASTAADGS LDSFSGSALG SSSFSPRPAH PFSPPQIYPS -KSYPHILPT 
           ACAPQIHSAS AQDSSNLLST EEPITPQPYN RSQYCQWPCK CPKTPPMCPP 
           QFYLSMATPN FDLRRSNQST EGDFYPARS- EARECQD-CT CPDTPGTCPP 

           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTES 
           GVSLLMDG-- -----CDCCR ACAKQVREAC NEKENCDHHR GLYCDYSADK 
           GVSRIMDG-- -----CDCCK MCAKQLNEPC DVRMRCDHHK GLYCDMNT--

Now I want to divide the file in 2 parts: 1st part will contain the first line of the sequence and thenames of the sequences only, like this :

     5    592
Homo_sapie
Macaca_mul
Mus_muscul 
Danio_reri
Ciona_inte

the 2nd part will be left with sequences only , like this :

           MEMQDLTSPH SRLSGSSESP SGPKLGNSHI NSNSMTPNGT EVKTEPMSSS 
           MEMQDLTSPH SRLSGSSESP SGPKLDNSHI NSNSMTPNGT EVKTEPMSSS 
           MEMQDLTSPH SRLSGSSESP SGPKLDSSHI NSTSMTPNGT EVKTEPMSSS 
           ---------- ---------- ---------- ---------M SWILMWSLLS 
           ---------- ---------- ---------- ------MLFS VYIVMMIVTS 

           ETASTTADGS LNNFSGSAIG SSSFSPRPTH QFSPPQIYPS NRPYPHILPT 
           ETASTTADGS LDNFSGSAIG SSNFSPRPTH QS-PPQIYAS NRPYPHILPT 
           EIASTAADGS LDSFSGSALG SSSFSPRPAH PFSPPQIYPS -KSYPHILPT 
           ACAPQIHSAS AQDSSNLLST EEPITPQPYN RSQYCQWPCK CPKTPPMCPP 
           QFYLSMATPN FDLRRSNQST EGDFYPARS- EARECQD-CT CPDTPGTCPP 

           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTEG 
           PSSQTMAAYG QTQFTTGMQQ ATAYATYPQP GQPYGISSYG ALWAGIKTES 
           GVSLLMDG-- -----CDCCR ACAKQVREAC NEKENCDHHR GLYCDYSADK 
           GVSRIMDG-- -----CDCCK MCAKQLNEPC DVRMRCDHHK GLYCDMNT--

Now, the file will be iterated for the columns containing "-" and also the empty columns. The resulting file will be containing only the sequence characters (No empty column or no column with any "-" ) like this:

EVKTEPMSSSETASTTADGSLNNFSGSAIGSSSFSPRPTQFPPQIPSRPYPHILPT EVKTEPMSSSETASTTADGSLDNFSGSAIGSSNFSPRPTQSPPQIASRPYPHILPT EVKTEPMSSSEIASTAADGSLDSFSGSALGSSSFSPRPAPFPPQIPSKSYPHILPT SWILMWSLLSACAPQIHSASAQDSSNLLSTEEPITPQPYRSYCQWCKPKTPPMCPP VYIVMMIVTSQFYLSMATPNFDLRRSNQSTEGDFYPARSEAECQDCTPDTPGTCPP

PSSQTMAATGMQQATAYATYPQPGQPYGISSYGALWAGIKT PSSQTMAATGMQQATAYATYPQPGQPYGISSYGALWAGIKT PSSQTMAATGMQQATAYATYPQPGQPYGISSYGALWAGIKT GVSLLMDGCDCCRACAKQVREACNEKENCDHHRGLYCDYSA GVSRIMDGCDCCKMCAKQLNEPCDVRMRCDHHKGLYCDMNT

Now, The 1st chunk containing the names and the last resulting file will be combined so that the naming order (as it were in the original file) is not changed..

Did I explain my probem to be understandable??

ADD REPLY • link 6.6 years ago by mdsiddra ▴ 30

1

Entering edit mode

This is what a meant by XY-Question :)

Before continue on this task you should have a closer look on how your input data must look like to solve this:

so that I may use this file either in some tests of phylogeny or in calculating the probabilistic values.

ADD REPLY • link 6.6 years ago by finswimmer 16k