format uniprot fasta headers
4
Hi,
I have a multi-fasta file with a header in the following format:
> sp| Q9Y5Q8| TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS= Homo sapiens GN= GTF3C5 PE= 1 SV= 2
I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:
> Q9Y5Q8
or
> TF3C5_HUMAN
I think sed can do it but I don't know the exact combination of regexp
Thanks
sequence
fasta-header
• 5.1k views
awk '{if ($0 ~ /^>/) {split($0 ,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file
If you want the TF*
names then
awk '{if ($0 ~ /^>/) {split($0 ,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file
awk -F '|' '/^>/ {printf(">%s\n",$2 );next;} {print;}' input.fasta
save the script as script.py and run as
python script.py file.fasta and you will get this
> Q9Y5Q8
LASJDQSMLASKDNAL
from __future__ import division
import sys
syntax = '' '
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta
------------------------------------------------------------------------------------
' ''
if len( sys.argv) != 2:
print syntax
sys.exit( )
dict = { }
seq = ""
prefix = sys.argv[ 1] .split( '.' ) [ 0]
outfile = open( prefix + '_' + 'extracted.fasta' ,'w' )
fasta_seqs = open( sys.argv[ 1] , 'r' )
for line in fasta_seqs:
line = line.rstrip( '\n' )
if line.startswith( '>' ) :
if seq:
dict[ name] = seq
seq = ""
name = line.split( '|' ) [ 1]
else:
seq = seq + line
dict[ name] = line
for key, value in dict.iteritems( ) :
outfile.write( '>' + key + '\n' + str( value) + '\n' )
Feel free to modify it as you need
If you only want the unique identifiers and not the sequences:
awk -F '|' '/^>/ {printf(">%s\n",$2 );}' proteome.fasta | cut -c 2- > identifiers.txt
Example input:
> sp| O67453| Y1476_AQUAE Uncharacterized protein aq_1476 OS= Aquifex aeolicus ( strain VF5) OX= 224324 GN= aq_1476 PE= 4 SV= 1
MLKSLTMENVKVVTGEIEKLRERIEKVKETLDLIPKEIEELERELERVRQEIAKKEDEL
AVAREIRHKEHEFTEVKQKIAYHRKYLERADSPREYERLLQERQKLIERAYKLSEEIYE
RRKYEALREEEEKLHQKEDEIEEKIHKLKKEYRALLNELKGLIEELNRKAREIIEKYGL
> tr| A0A384D5E1| A0A384D5E1_URSMA Prokineticin-1 OS= Ursus maritimus OX= 29073 GN= PROK1 PE= 3 SV= 1
MRGAMRVSIMFLLVTVSDCAVITGACERDVQCGAGTCCAISLWLRGLRMCTPLGREGEEC
HPGSHKVPFFRRRQHHTCPCLPSLLCSRCLDGRYRCSTDLKNINF
Example output:
O67453
A0A384D5E1
Login before adding your answer.
Traffic: 1912 users visited in the last hour
What you need is
cut -d '|'
sed -e 's/^>.\|//' -e 's/ .//' file
thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2
sorry, my bad forgot the wild card