Question

Compiling UniProt txt file into a table of ( AC - ACcession number) and (MOD_RES)

0

Entering edit mode

9.0 years ago

ahmedakhokhar ▴ 150

Dear all,

I am not much familiar with python and trying to retrieve data from a text file(test1), Uniprot, that looks like this:

ID   YSH1_YEAST              Reviewed;         779 AA.
AC   Q06224; D6VYS4;
DT   10-JAN-2006, integrated into UniProtKB/Swiss-Prot
DT   01-NOV-1996, sequence version 1.
..
FT   METAL       184    184       Zinc 1. {ECO:0000250}.
FT   METAL       184    184       Zinc 2. {ECO:0000250}.
FT   METAL       430    430       Zinc 2. {ECO:0000250}.
FT   MOD_RES     517    517       Phosphoserine; by ATM or ATR.
FT                                {ECO:0000244|PubMed:18407956}.
FT   MUTAGEN      37     37       D->N: Loss of endonuclease activity.
.
.

So far I am able to retrieve the MOD_RES and AC separately, by using these codelets:

test = open('test1', 'r')
regex2 = re.compile(r'^AC\s+\w+')
for line in test:
    ac = regex2.findall(line)
    for a in ac:
        a=''.join(a)
        print(a[5:12])

Q06224
P16521

testfile = open('test1')
regex = re.compile(r'^FT\s+\MOD_RES\s+\w+\s+\w+\s+\w.+')
for line in testfile:
    po = regex.findall(line)
    for p in po:
        p=''.join(p)
        print(p[23:48])

517       Phosphoserine;
2       N-acetylserine
187       N6,N6,N6-trime
196       N6,N6,N6-trime

The goal is to get AC and their relevant Modification residues (MOD_RES) into a tab separate format. Also, if more than one MOD_RES appear in the data for a particular AC, duplicate that AC and get the table format like this:

AC  MOD_RES
Q06224  517    517       Phosphoserine
P04524  75    75       Phosphoserine
Q06224  57    57       Phosphoserine

python uniprot • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by ahmedakhokhar ▴ 150

1

Entering edit mode

Step-1: Use Bio.SwissProt to parse the input. If you don't, this is just text processing - not many would help you out with such generic code.

ADD REPLY • link 9.0 years ago by Ram 44k