How to extract Gene symbol from protein fasta file header?
5
2
Entering edit mode
10.4 years ago
Abdul Rawoof ▴ 60

Hello,

I have multiple fasta headers in a file and I want to extract only Gene_Symbol from the all headers in a separate file.

>IPI:IPI00000875.1|SWISSPROT:P01141|TREMBL:Q5U081|ENSEMBL:ENSP00000357648|REFSEQ:NP_002255|VEGA:OTTHUMP00000012879 Tax_Id=9806 Gene_Symbol=NOTCH Kinase NotchRas

Expected Result:

Gene_Symbol=NOTCH Kinase NotchRas

I have tried following perl script;

chomp($fname=<STDIN>);

open(IN,$fname) or die "Not correct file!!";
@cont=<IN>; close IN;

open(OUT,">IPIGenes.txt") or die "Can't open it !!";
$size=@cont;

for($i=0;$i<=$size;$i++)
{ 
    chomp($cont[$i]); 
    @data=split('\|',$cont[$i]); 
    { 
      if($data[$i]=~/^Gene_Symbol/)
          {print OUT"$data[$i]\n";}
      else{skip;}
     }
}

But I am not getting any output.

Thanks in advance

GENENAME PERL FASTA • 5.2k views
ADD COMMENT
0
Entering edit mode

@Abdul Rawoof,

To be honest, I am not that good in writing those scripts. I have similar problem with sorting the gene names from fasta headers. I wonder to know how you managed to sort gene codes using excel easily?

Thanks in advance,
Shewit

ADD REPLY
0
Entering edit mode

Dear skalayout, what I did is that, first of all I extracted all fasta header in a separated text file using a small perl script.

You will get protein fasta header like following

>IPI:IPI00000875.1|SWISSPROT:P01141|TREMBL:Q5U081|ENSEMBL:ENSP00000357648|REFSEQ:NP_002255|VEGA:OTTHUMP00000012879 Tax_Id=9806 Gene_Symbol=NOTCH Kinase NotchRas

Further I replaced "Gene_Symbol" with "#Gene_Symbol" using Find and replace option in textpad and saved changes. After that I open this in Excel usig Text import wizard > select delimited > next > select tab button and in other option put # symbol and finish. You will get Gene symbol=gene name in separate column.

Best,
Abdul Rawoof

ADD REPLY
0
Entering edit mode

Thanks Abdul. Amazingly, your suggestion is still helpful, even after two years:)

Thanks again.

ADD REPLY
3
Entering edit mode
10.4 years ago
Neilfws 49k

Assuming that all the headers are of that form, (i.e. gene symbols are preceded by Gene_Symbol= and the symbol text is the last entry before end of line), then this is very easy using only grep:

grep -ioP "Gene_Symbol=(.*?)$" inFile.fasta >IPIGenes.txt
ADD COMMENT
1
Entering edit mode

I suspect that case insensitive matching is not required in this case (so loose -i), and the use of the slower Perl regular expressions is probably not necessary either, so the following would likely do:

grep -o 'Gene_Symbol=.*$' inFile.fasta >IPIGenes.out

If there are problems with "Gene_Symbol=" occurring elsewhere in the fasta headers, then the following would be a bit more robust:

grep -o ' Gene_Symbol=.*$' inFile.fasta > IPIGenes.out

Or if you prefer Perl regular expressions, the word boundary can be used instead:

grep -oP '\bGene_Symbol=.*$' inFile.fasta > IPIGenes.out
ADD REPLY
1
Entering edit mode
10.4 years ago

You messed up the format when you posted your header. So I edited it a little. If it is the correct format then you can try:

grep "^>" protein.fasta | awk '{split($0,a,"9806"); print a[2]}'

grep "^>" will only print lines from the files that belong to header or that start with ">"

awk will then take that line and split it using "9806" as a delimited and print second element.

If there is a "|" between 9806 and Gene_Symbol. Then you can use it as a delimiter too.

You need to pipe this now: awk '{split($0,a,"|"); print a[7]}'

ADD COMMENT
0
Entering edit mode

Thanks for you suggestion dear..but I have multiple different Tax_ID for each entry. Then what should I do for that.

ADD REPLY
0
Entering edit mode

Do you have | character between your tax_id and Gene Symbol as I mentioned above. In that case, you can use awk '{split($0,a,"|"); print a[7]}' as I suggested. Otherwise you can use = as delimiter and print 3rd element like grep "^>" protein.fasta | awk '{split($0,a,"="); print a[3]}' to get the gene symbol.

ADD REPLY
0
Entering edit mode

Thanks Ashutosh...your suggession are helpful for me..but I have did this with excel also..and I got it easily...Thanks again...

ADD REPLY
1
Entering edit mode
10.4 years ago
Phil S. ▴ 700

If your header looks like the one mentioned above you can go with:

grep "^>" foo.fasta | awk '{split($0,a," "); split(a[3],b,"="); print b[2]}'

this will output "NOTCH"

if you want to have something like:

Gene_Symbol=NOTCH

do this:

grep "^>" foo.fasta | awk '{split($0,a," "); print a[3]}'
ADD COMMENT
1
Entering edit mode
10.4 years ago
Kenosis ★ 1.3k

Here are a couple of Perl options:

use strict;
use warnings;

while (<>) {
    print "$1\n" if /(Gene_Symbol.+)/;
}

Command-line usage: perl script.pl inFile.fasta >IPIGenes.txt

As a one-liner:

perl -lne 'print $1 if /(Gene_Symbol.+)/' inFile.fasta >IPIGenes.txt

Output from both on your dataset:

Gene_Symbol=NOTCH Kinase NotchRas

There's no need for the usual check for a fasta header, viz., /^>/, since Gene_Symbol will only appear in the header.

Hope this helps!

ADD COMMENT
0
Entering edit mode

Hey hi...this is a great help for me...I was also trying to do like that but failed..and I did it with excel easily...

Thanks,

ADD REPLY
1
Entering edit mode
9.7 years ago
Prakki Rama ★ 2.7k

My way(:)

$fname="input.fasta";

open(IN,$fname) or die "Not correct file!!";

while(<IN>)
{
    if($_=~/\>.+(Gene_Symbol=.+)$/)
    {
    print "$1\n";
    }
}

close(IN);
ADD COMMENT

Login before adding your answer.

Traffic: 1909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6