Hello, I have started a code that allows me to retrieve in a csv file two columns that have (one has the protein sequence, and the other the sequence with mutations = MUT). I would like to get the sequences, divide them into a list of characters and then comaprate the two lists of characters (sequences). Here is my code but when I put print(caraM), each line corresponds to a letter of sequences but I wanted a list of the mutation sequence (for example: caraMut = ['A',M',T','N'?.......] ) to be able to comaprate the two sequences Here is my Python code:
with open ('data2.csv', 'r') as myFile:
lignes=myFile.readlines()
for ligne in lignes :
split_tableau= ligne.split(",")
seq= split_tableau[4]
mut= split_tableau[6]
for caraS in seq :
caraSeq= caraS.split()
for caraM in mut :
caraMut= caraM.split(
print(caraMut)
for i in seq : for j in mut : if (i=! j) AND (j=!" "): box= print (i,">", "j")
This?
You have tags like pandas and dataframe but you are not using any pandas. Do you want it in pandas or pure python?
In pandas there are many options but for you something like this would be the easyest (not working, just inspiration)
Thank you for your reply, No matter in pandas or pure python. But the goal is to have two lists with the sequences to be able to access and manipulate for example the mutation positions. I have to add a column in my csv file that highlights the mutations for example (L 152>M) i.e. in position 152 the amino acid L has been changed to M. So with the code you showed me I don't think we can have the mutation positions for example.
I didn't understand the line "return comparison". Where does comparison come from?
Its not very clear to me from your post what you're trying to achieve or why pandas/dataframes are relevant tags (or why even your data is a table).
Am I correct in understanding that you have 2 sequences represented as columns of a table - thus each 'row' is a new sequence character? e.g.
And you want to identify the grid reference where they differ?
You haven't told us whether the sequences are aligned. If they aren't this wont work (or will at least be basically meaningless).
If I am correct, then the below will work as a general approach but you'll need to modify it for your data input type: How to display mismatched sequences from alignment when using Biopython
Thank you for your answer. No, it's not. At first I have a csv file with two columns: column 4 represents a sequence and column 6 the sequence with the mutations. this csv table has thousands of lines. each line has a complete protein sequence in column 4 and a sequence of mutations in column 4. I would like to transform the two sequences into two lists in order to be able to compare them. In fact, when the sequence of mutations has a letter, it is at this position that there is a mutation between a reference sequence and the corresponding sequence, and if there is a space in the sequence of mutations, it means that there is a match.
Please provide some minimal working example data and some expected output.