Question

How to create a dataset using sequence file in python

0

Entering edit mode

10.8 years ago

Jason Lin • 0

I have a protein sequence file looks like this:

>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL       -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX

The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.

By coding in Python I would like to generate a table which should look like this:

name of the sequence
total number of missing coordinates (which is the number of X)
the range of these missing coordinates (which is the range of the position of those X)
the length of the sequence
the actual sequence

So the final results should looks like this:

>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

And my code looks like this so far:

total_seq = []
with open('sample.txt') as lines:
    for l in lines:
        split_list = l.split()

        # Assign the list number
        header = split_list[0]                                # 1
        seq = split_list[1]                                   # 5
        disorder = split_list[2]

        # count sequence length and total residue of missing coordinates
        sequence_length = len(seq)                            # 4

        for x in disorder:
            counts = 0
            if x == 'X':
                counts = counts + 1

        total_seq.append([header, seq, str(counts)])   # obviously I haven't finish coding 2 & 3

with open('new_sample.txt', 'a') as f:
    for lol in total_seq:
        f.write('\n'.join(lol))

I'm new in python, would anyone help please, thank you so much guys!

python • 5.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 10.8 years ago by Jason Lin • 0

0

Entering edit mode

Have any of the answers/comments to your previous questions (links above) provided any assistance?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by kavin.pl ▴ 70

0

Entering edit mode

It helped. But for this I still don't understand how to solve number 2 and 3 in my goal. which is the total number of missing coordinates and the range of those missing coordinates.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Jason Lin • 0

Ram · Answer 1 · 2014-07-11

For question 2):

disorder = '---XX--XXX--'
print disorder.count('X')

This uses string's count() method.

For question 3):

from itertools import groupby, count
indices = [i for i, x in enumerate(disorder) if x=='X']

def as_range(iterable): # not sure how to do this part elegantly
    l = list(iterable)
    if len(l) > 1:
        return '{0}-{1}'.format(l[0], l[-1])
    else:
        return '{0}'.format(l[0])

print ','.join(as_range(g) for _, g in groupby(indices, key=lambda n, c=count():\
 n-next(c)))

This is more complicated. You may want to read these: