Question

Compare two sequences in python

0

Entering edit mode

2.6 years ago

Debut ▴ 20

Hello, I have started a code that allows me to retrieve in a csv file two columns that have (one has the protein sequence, and the other the sequence with mutations = MUT). I would like to get the sequences, divide them into a list of characters and then comaprate the two lists of characters (sequences). Here is my code but when I put print(caraM), each line corresponds to a letter of sequences but I wanted a list of the mutation sequence (for example: caraMut = ['A',M',T','N'?.......] ) to be able to comaprate the two sequences Here is my Python code:

with open ('data2.csv', 'r') as myFile:

lignes=myFile.readlines()
for ligne in lignes :
    split_tableau= ligne.split(",")
    seq= split_tableau[4]
    mut= split_tableau[6]

    for caraS in seq :
        caraSeq= caraS.split()
    for caraM in mut :
        caraMut= caraM.split(
        print(caraMut)

for i in seq : for j in mut : if (i=! j) AND (j=!" "): box= print (i,">", "j")

python pandas dataframe csv • 2.5k views

ADD COMMENT • link 2.6 years ago by Debut ▴ 20

0

Entering edit mode

This?

for ligne in lignes :
    split_tableau= ligne.split(",")
    seq= list(split_tableau[4])
    mut= list(split_tableau[6])

You have tags like pandas and dataframe but you are not using any pandas. Do you want it in pandas or pure python?

In pandas there are many options but for you something like this would be the easyest (not working, just inspiration)

def compare(row):
    seq= list(row["seq_col"])
    mut= list(row["mut_col"])
    #compare code
    return comparison

df = pd.read_csv('data2.csv') 
df["comparison"] = df.apply(compare, axis=1)

ADD REPLY • link 2.6 years ago by gb ★ 2.2k

0

Entering edit mode

Thank you for your reply, No matter in pandas or pure python. But the goal is to have two lists with the sequences to be able to access and manipulate for example the mutation positions. I have to add a column in my csv file that highlights the mutations for example (L 152>M) i.e. in position 152 the amino acid L has been changed to M. So with the code you showed me I don't think we can have the mutation positions for example.

ADD REPLY • link 2.6 years ago by Debut ▴ 20

0

Entering edit mode

I didn't understand the line "return comparison". Where does comparison come from?

ADD REPLY • link 2.6 years ago by Debut ▴ 20

0

Entering edit mode

Its not very clear to me from your post what you're trying to achieve or why pandas/dataframes are relevant tags (or why even your data is a table).

Am I correct in understanding that you have 2 sequences represented as columns of a table - thus each 'row' is a new sequence character? e.g.

Seq1, Seq2
A, A
C, C
T, C
G, G

And you want to identify the grid reference where they differ?

You haven't told us whether the sequences are aligned. If they aren't this wont work (or will at least be basically meaningless).

If I am correct, then the below will work as a general approach but you'll need to modify it for your data input type: How to display mismatched sequences from alignment when using Biopython

ADD REPLY • link 2.6 years ago by Joe 21k

0

Entering edit mode

Thank you for your answer. No, it's not. At first I have a csv file with two columns: column 4 represents a sequence and column 6 the sequence with the mutations. this csv table has thousands of lines. each line has a complete protein sequence in column 4 and a sequence of mutations in column 4. I would like to transform the two sequences into two lists in order to be able to compare them. In fact, when the sequence of mutations has a letter, it is at this position that there is a mutation between a reference sequence and the corresponding sequence, and if there is a space in the sequence of mutations, it means that there is a match.

ADD REPLY • link 2.6 years ago by Debut ▴ 20

0

Entering edit mode

Please provide some minimal working example data and some expected output.

ADD REPLY • link 2.6 years ago by Joe 21k

0

Entering edit mode

Seq           GEDAPEEMN 


----------


Mut                   LM  


----------


output seq     ['G',    'E', 'D', 'A', 'P','E', 'E', 'M', 'N'] (list) 


----------


output mut [' L  ', 'M ', '    ', '', '',' ', ' ', ' ', ' ']

ADD REPLY • link updated 2.6 years ago by Joe 21k • written 2.6 years ago by Debut ▴ 20

score 0 · Answer 1 · 2022-05-20

0

Entering edit mode

2.6 years ago

Debut ▴ 20

I have tried it and it seems to work for ligne in lignes : split_tableau= ligne.split(",") seq= split_tableau[4] Sequ= list(seq) print(Sequ) mut= split_tableau[6] Mut=list(mut) print(Mut)

ADD COMMENT • link 2.6 years ago by Debut ▴ 20