Question

Bioinformatics Q: How to align and compare two elements (sequence) in a list using python

0

Entering edit mode

10.4 years ago

Jason Lin • 0

here is my question:

I've got a file which looks like this:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: ----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX

It contains name, which in this case is 103L; protein sequence, which has "Sequence:" label; disorder region, which is after "Disorder:". the "-" represent that this position is ordered, and "X" represent that this particular position is disordered. For example, that last two "XX" under disorder represent that the last two position of the protein sequence is disordered, which is "NL". After I use split method, it looks like this:

['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']

I want to use python to find the disorder sequence and its position. So the final file should look somewhat like this: Name Sequence: 'real sequence' Disorder: position(Posi) residue-name(R) Take 103L as an example:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

Disorder: Posi R
               34    K
               35    S
               36    P
               37    S
               38    L
               39    N
               65    N
               66    L

I am new in python, really hope someone can help me, thank you so much!!!

python protein-sequence alignment • 3.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Jason Lin • 0

Ram · Answer 1 · 2014-07-08

2

Entering edit mode

10.4 years ago

Asaf 10k

print "Disorder: Posi R"
sp_line = data.split()
for i, x in enumerate(sp_line[4]):
    if x == 'X':
        print('%d\t%s'%(i + 1, sp_line[2][i]))

ADD COMMENT • link updated 5.5 years ago by Ram 44k • written 10.4 years ago by Asaf 10k

Ram · Answer 2 · 2014-07-08

For each input line, this function will give you a dictionary that you can use later to format your output as desired:

def getDisorderedPositions(sequence, matches):
    """For positions in 'matches' where char is not '-', return the corresponding char
    in 'sequence' in the form of a dictionary.
    """
    if len(sequence) != len(matches):
        return(None)
    disorder= {}
    for i in range(0, len(sequence)):
        if not matches[i] == '-':
            disorder[i]= sequence[i]
    return(disorder)

Example:

region= ['>103L', 'Sequence', 'MNIFEMLRID', 'Disorder', '---XX----X']
sequence= region[2]
matches= region[4]

dis= getDisorderedPositions(sequence, matches)
for k in sorted(dis.keys()):
    print(k, dis[k])
(3, 'F')
(4, 'E')
(9, 'D')

Ram · Answer 3 · 2014-07-08

1

Entering edit mode

10.4 years ago

Matt Shirley 10k

I like Asaf's answer, but like the idea of iterating more over data with less indexing:

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Matt Shirley 10k