Translate long names of proteins to short names
1
0
Entering edit mode
5.3 years ago
muk.smita • 0

I have a list of proteins which are identified as following:

sp|P20930|FILA_HUMAN Filaggrin OS=Homo sapiens OX=9606 GN=FLG PE=1 SV=3
sp|Q5D862|FILA2_HUMAN Filaggrin-2 OS=Homo sapiens OX=9606 GN=FLG2 PE=1 SV=1
sp|P29508|SPB3_HUMAN Serpin B3 OS=Homo sapiens OX=9606 GN=SERPINB3 PE=1 SV=2
sp|Q08188|TGM3_HUMAN Protein-glutamine gamma-glutamyltransferase E OS=Homo sapiens OX=9606 GN=TGM3 PE=1 SV=4
sp|P31025|LCN1_HUMAN Lipocalin-1 OS=Homo sapiens OX=9606 GN=LCN1 PE=1 SV=1
sp|P62805|H4_HUMAN Histone H4 OS=Homo sapiens OX=9606 GN=HIST1H4A PE=1 SV=2

Can I translate these identifiers in a more manageable form?

sequence • 1.1k views
ADD COMMENT
0
Entering edit mode

What is for you 'more manageable'? Would that be the uniprot name (e.g., P20930), or Full name (e.g., Filaggrin), or Gene Symbol (FLG)? Be more specific please.

ADD REPLY
0
Entering edit mode

I would like to know how I can shorten the protein identity to Full name and also gene symbol.

Thank you

ADD REPLY
0
Entering edit mode

You can use awk to extract these.

For full name something like this will work.

cat file.txt | awk 'BEGIN { FS="HUMAN " } { print $2 }' | awk '{ FS=" OS=" } { print $1 }'

For the gene symbols something like this.

cat file.txt | awk 'BEGIN { FS="GN=" } { print $2 }' | awk '{ FS=" PE=" } { print $1 }'
ADD REPLY
0
Entering edit mode
5.3 years ago

use cut to extract the second field in your file i.e the UNIPROT id : P20930, Q5D862, etc...

cut -d "|" -f 2

-d defines the separator. Here "|"

-f defines which field you select. Here the 2nd one

ADD COMMENT

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6