Entering edit mode
9.0 years ago
ahmedakhokhar
▴
150
Dear all,
I am not much familiar with python and trying to retrieve data from a text file(test1), Uniprot, that looks like this:
ID YSH1_YEAST Reviewed; 779 AA.
AC Q06224; D6VYS4;
DT 10-JAN-2006, integrated into UniProtKB/Swiss-Prot
DT 01-NOV-1996, sequence version 1.
..
FT METAL 184 184 Zinc 1. {ECO:0000250}.
FT METAL 184 184 Zinc 2. {ECO:0000250}.
FT METAL 430 430 Zinc 2. {ECO:0000250}.
FT MOD_RES 517 517 Phosphoserine; by ATM or ATR.
FT {ECO:0000244|PubMed:18407956}.
FT MUTAGEN 37 37 D->N: Loss of endonuclease activity.
.
.
So far I am able to retrieve the MOD_RES
and AC
separately, by using these codelets:
test = open('test1', 'r')
regex2 = re.compile(r'^AC\s+\w+')
for line in test:
ac = regex2.findall(line)
for a in ac:
a=''.join(a)
print(a[5:12])
Q06224
P16521
testfile = open('test1')
regex = re.compile(r'^FT\s+\MOD_RES\s+\w+\s+\w+\s+\w.+')
for line in testfile:
po = regex.findall(line)
for p in po:
p=''.join(p)
print(p[23:48])
517 Phosphoserine;
2 N-acetylserine
187 N6,N6,N6-trime
196 N6,N6,N6-trime
The goal is to get AC and their relevant Modification residues (MOD_RES
) into a tab separate format. Also, if more than one MOD_RES
appear in the data for a particular AC, duplicate that AC and get the table format like this:
AC MOD_RES
Q06224 517 517 Phosphoserine
P04524 75 75 Phosphoserine
Q06224 57 57 Phosphoserine
Step-1: Use Bio.SwissProt to parse the input. If you don't, this is just text processing - not many would help you out with such generic code.