format uniprot fasta headers
4
Hi,
I have a multi-fasta file with a header in the following format:
>sp|Q9Y5Q8|TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2
I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:
>Q9Y5Q8
or
>TF3C5_HUMAN
I think sed can do it but I don't know the exact combination of regexp
Thanks
sequence
fasta-header
• 4.6k views
•
link
updated 2.6 years ago by
GenoMax
147k
•
written 7.6 years ago by
jfertaj
▴
110
awk '{if ($0 ~ /^>/) {split($0,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file
If you want the TF*
names then
awk '{if ($0 ~ /^>/) {split($0,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file
awk -F '|' '/^>/ {printf(">%s\n",$2);next;} {print;}' input.fasta
save the script as script.py and run as
python script.py file.fasta and you will get this
>Q9Y5Q8
LASJDQSMLASKDNAL
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
from __future__ import division
import sys
##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta
------------------------------------------------------------------------------------
'''
##########################################################################################
if len(sys.argv) != 2:
print syntax
sys.exit()
##########################################################################################
dict = {}
seq = ""
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'extracted.fasta','w')
fasta_seqs = open(sys.argv[1], 'r')
for line in fasta_seqs:
line = line.rstrip('\n')
if line.startswith('>'):
if seq:
dict[name] = seq
seq = ""
name = line.split('|')[1]
else:
seq = seq + line
dict[name] = line
for key, value in dict.iteritems():
outfile.write('>' + key + '\n' + str(value) + '\n')
Feel free to modify it as you need
If you only want the unique identifiers and not the sequences:
awk -F '|' '/^>/ {printf(">%s\n",$2);}' proteome.fasta | cut -c 2- > identifiers.txt
Example input:
>sp|O67453|Y1476_AQUAE Uncharacterized protein aq_1476 OS=Aquifex aeolicus (strain VF5) OX=224324 GN=aq_1476 PE=4 SV=1
MLKSLTMENVKVVTGEIEKLRERIEKVKETLDLIPKEIEELERELERVRQEIAKKEDEL
AVAREIRHKEHEFTEVKQKIAYHRKYLERADSPREYERLLQERQKLIERAYKLSEEIYE
RRKYEALREEEEKLHQKEDEIEEKIHKLKKEYRALLNELKGLIEELNRKAREIIEKYGL
>tr|A0A384D5E1|A0A384D5E1_URSMA Prokineticin-1 OS=Ursus maritimus OX=29073 GN=PROK1 PE=3 SV=1
MRGAMRVSIMFLLVTVSDCAVITGACERDVQCGAGTCCAISLWLRGLRMCTPLGREGEEC
HPGSHKVPFFRRRQHHTCPCLPSLLCSRCLDGRYRCSTDLKNINF
Example output:
O67453
A0A384D5E1
Login before adding your answer.
Traffic: 2431 users visited in the last hour
What you need is
cut -d '|'
sed -e 's/^>.\|//' -e 's/ .//' file
thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2
sorry, my bad forgot the wild card