Bioinformatics Q: How to align and compare two elements (sequence) in a list using python

0

Entering edit mode

11.1 years ago

Jason Lin • 0

here is my question:

I've got a file which looks like this:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: ----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX

It contains name, which in this case is 103L; protein sequence, which has "Sequence:" label; disorder region, which is after "Disorder:". the "-" represent that this position is ordered, and "X" represent that this particular position is disordered. For example, that last two "XX" under disorder represent that the last two position of the protein sequence is disordered, which is "NL". After I use split method, it looks like this:

['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']

I want to use python to find the disorder sequence and its position. So the final file should look somewhat like this: Name Sequence: 'real sequence' Disorder: position(Posi) residue-name(R) Take 103L as an example:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

Disorder: Posi R
               34    K
               35    S
               36    P
               37    S
               38    L
               39    N
               65    N
               66    L

I am new in python, really hope someone can help me, thank you so much!!!

python protein-sequence alignment • 3.3k views

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.1 years ago by Jason Lin • 0

2

Entering edit mode

11.1 years ago

Asaf 10k

print "Disorder: Posi R"
sp_line = data.split()
for i, x in enumerate(sp_line[4]):
    if x == 'X':
        print('%d\t%s'%(i + 1, sp_line[2][i]))

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 11.1 years ago by Asaf 10k

1

Entering edit mode

11.1 years ago

dariober 15k

For each input line, this function will give you a dictionary that you can use later to format your output as desired:

def getDisorderedPositions(sequence, matches):
    """For positions in 'matches' where char is not '-', return the corresponding char
    in 'sequence' in the form of a dictionary.
    """
    if len(sequence) != len(matches):
        return(None)
    disorder= {}
    for i in range(0, len(sequence)):
        if not matches[i] == '-':
            disorder[i]= sequence[i]
    return(disorder)

Example:

region= ['>103L', 'Sequence', 'MNIFEMLRID', 'Disorder', '---XX----X']
sequence= region[2]
matches= region[4]

dis= getDisorderedPositions(sequence, matches)
for k in sorted(dis.keys()):
    print(k, dis[k])
(3, 'F')
(4, 'E')
(9, 'D')

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 11.1 years ago by dariober 15k

1

Entering edit mode

11.1 years ago

Matt Shirley 10k

I like Asaf's answer, but like the idea of iterating more over data with less indexing:

	with open('input.sequencefile') as fh:
	for line in fh:
	name, seqid, seq, disid, dis = line.split()
	print(' '.join([name, seqid]))
	print(seq)
	print(disid)
	print('Pos R')
	for i, (s, x) in enumerate(zip(seq, dis)):
	if x == 'X':
	print(' '.join([i + 1, s]))

view raw disorder.py hosted with ❤ by GitHub

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.1 years ago by Matt Shirley 10k

Login before adding your answer.