format uniprot fasta headers
4
1
Entering edit mode
7.6 years ago
jfertaj ▴ 110

Hi,

I have a multi-fasta file with a header in the following format:

>sp|Q9Y5Q8|TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:

>Q9Y5Q8

or

>TF3C5_HUMAN

I think sed can do it but I don't know the exact combination of regexp

Thanks

sequence fasta-header • 4.6k views
ADD COMMENT
0
Entering edit mode

What you need is cut -d '|'

ADD REPLY
0
Entering edit mode

sed -e 's/^>.\|//' -e 's/ .//' file

ADD REPLY
0
Entering edit mode

thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

ADD REPLY
0
Entering edit mode

sorry, my bad forgot the wild card

sed -e 's/^>.*\|/>/' -e 's/ .*//' file
ADD REPLY
4
Entering edit mode
7.6 years ago
GenoMax 147k
awk '{if ($0 ~ /^>/)  {split($0,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file

If you want the TF* names then

awk '{if ($0 ~ /^>/)  {split($0,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file
ADD COMMENT
0
Entering edit mode

thanks a lot @genomax2, if you write your comment as an answer I will give accept it as an answer

ADD REPLY
2
Entering edit mode
7.6 years ago
awk -F '|' '/^>/ {printf(">%s\n",$2);next;} {print;}' input.fasta
ADD COMMENT
0
Entering edit mode

thanks @Pierre, even more concise!! +1, could you please explain that does the first part of the awk command just right after the field separator command?

ADD REPLY
0
Entering edit mode

If the line begins with > do next thing.

ADD REPLY
0
Entering edit mode
7.6 years ago
Buffo ★ 2.4k

save the script as script.py and run as

python script.py file.fasta and you will get this
>Q9Y5Q8
LASJDQSMLASKDNAL

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 2:
    print syntax
    sys.exit()

##########################################################################################

dict = {}
seq = ""
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'extracted.fasta','w')
fasta_seqs = open(sys.argv[1], 'r')

for line in fasta_seqs:

    line = line.rstrip('\n')

    if line.startswith('>'):
        if seq:            
            dict[name] = seq
            seq = ""
        name = line.split('|')[1]                        

    else:
        seq = seq + line 

dict[name] = line

for key, value in dict.iteritems():
    outfile.write('>' + key + '\n' + str(value) + '\n')

Feel free to modify it as you need

ADD COMMENT
1
Entering edit mode

In this case, I would make it little easier for User using BioPython module:

from Bio import SeqIO
for seq_record in SeqIO.parse('sample.fasta', 'fasta'):
  header = seq_record.id
  UniprotID ='>'+str(header.split('|')[1])
  ProteinName='>'+str((header.split('|')[-1]).split(' ')[0])
  seqs = str(seq_record.seq)
  print UniprotID
  print seqs
ADD REPLY
0
Entering edit mode

Yes I know, but personally I donĀ“t like to use biopython, and even less to use print for fasta files, I think exist a function called write _fasta or something like that on seqIO module doesn`t it?

ADD REPLY
0
Entering edit mode

Yes SeqIO.write() exists, or you can use print with the format() function for proper output.

ADD REPLY
0
Entering edit mode
2.6 years ago

If you only want the unique identifiers and not the sequences:

awk -F '|' '/^>/ {printf(">%s\n",$2);}' proteome.fasta | cut -c 2- > identifiers.txt

Example input:

>sp|O67453|Y1476_AQUAE Uncharacterized protein aq_1476 OS=Aquifex aeolicus (strain VF5) OX=224324 GN=aq_1476 PE=4 SV=1
MLKSLTMENVKVVTGEIEKLRERIEKVKETLDLIPKEIEELERELERVRQEIAKKEDEL
AVAREIRHKEHEFTEVKQKIAYHRKYLERADSPREYERLLQERQKLIERAYKLSEEIYE
RRKYEALREEEEKLHQKEDEIEEKIHKLKKEYRALLNELKGLIEELNRKAREIIEKYGL
>tr|A0A384D5E1|A0A384D5E1_URSMA Prokineticin-1 OS=Ursus maritimus OX=29073 GN=PROK1 PE=3 SV=1
MRGAMRVSIMFLLVTVSDCAVITGACERDVQCGAGTCCAISLWLRGLRMCTPLGREGEEC
HPGSHKVPFFRRRQHHTCPCLPSLLCSRCLDGRYRCSTDLKNINF

Example output:

O67453
A0A384D5E1

ADD COMMENT

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6