Question

How to extract Gene symbol from protein fasta file header?

2

Entering edit mode

10.8 years ago

Abdul Rawoof ▴ 60

Hello,

I have multiple fasta headers in a file and I want to extract only Gene_Symbol from the all headers in a separate file.

>IPI:IPI00000875.1|SWISSPROT:P01141|TREMBL:Q5U081|ENSEMBL:ENSP00000357648|REFSEQ:NP_002255|VEGA:OTTHUMP00000012879 Tax_Id=9806 Gene_Symbol=NOTCH Kinase NotchRas

Expected Result:

Gene_Symbol=NOTCH Kinase NotchRas

I have tried following perl script;

chomp($fname=<STDIN>);

open(IN,$fname) or die "Not correct file!!";
@cont=<IN>; close IN;

open(OUT,">IPIGenes.txt") or die "Can't open it !!";
$size=@cont;

for($i=0;$i<=$size;$i++)
{ 
    chomp($cont[$i]); 
    @data=split('\|',$cont[$i]); 
    { 
      if($data[$i]=~/^Gene_Symbol/)
          {print OUT"$data[$i]\n";}
      else{skip;}
     }
}

But I am not getting any output.

Thanks in advance

GENENAME PERL FASTA • 5.6k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Abdul Rawoof ▴ 60

0

Entering edit mode

@Abdul Rawoof,

To be honest, I am not that good in writing those scripts. I have similar problem with sorting the gene names from fasta headers. I wonder to know how you managed to sort gene codes using excel easily?

Thanks in advance,
Shewit

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.4 years ago by skalayout • 0

0

Entering edit mode

Dear skalayout, what I did is that, first of all I extracted all fasta header in a separated text file using a small perl script.

You will get protein fasta header like following

>IPI:IPI00000875.1|SWISSPROT:P01141|TREMBL:Q5U081|ENSEMBL:ENSP00000357648|REFSEQ:NP_002255|VEGA:OTTHUMP00000012879 Tax_Id=9806 Gene_Symbol=NOTCH Kinase NotchRas

Further I replaced "Gene_Symbol" with "#Gene_Symbol" using Find and replace option in textpad and saved changes. After that I open this in Excel usig Text import wizard > select delimited > next > select tab button and in other option put # symbol and finish. You will get Gene symbol=gene name in separate column.

Best,
Abdul Rawoof

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.4 years ago by Abdul Rawoof ▴ 60

0

Entering edit mode

Thanks Abdul. Amazingly, your suggestion is still helpful, even after two years:)

Thanks again.

ADD REPLY • link 8.2 years ago by skalayout • 0

Ram · Answer 1 · 2014-07-17

3

Entering edit mode

10.8 years ago

Neilfws 49k

Assuming that all the headers are of that form, (i.e. gene symbols are preceded by Gene_Symbol= and the symbol text is the last entry before end of line), then this is very easy using only grep:

grep -ioP "Gene_Symbol=(.*?)$" inFile.fasta >IPIGenes.txt

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Neilfws 49k

1

Entering edit mode

I suspect that case insensitive matching is not required in this case (so loose -i), and the use of the slower Perl regular expressions is probably not necessary either, so the following would likely do:

grep -o 'Gene_Symbol=.*$' inFile.fasta >IPIGenes.out

If there are problems with "Gene_Symbol=" occurring elsewhere in the fasta headers, then the following would be a bit more robust:

grep -o ' Gene_Symbol=.*$' inFile.fasta > IPIGenes.out

Or if you prefer Perl regular expressions, the word boundary can be used instead:

grep -oP '\bGene_Symbol=.*$' inFile.fasta > IPIGenes.out

ADD REPLY • link 10.8 years ago by hpmcwill ★ 1.2k

Ram · Answer 2 · 2014-07-10

1

Entering edit mode

10.8 years ago

Ashutosh Pandey 12k

You messed up the format when you posted your header. So I edited it a little. If it is the correct format then you can try:

grep "^>" protein.fasta | awk '{split($0,a,"9806"); print a[2]}'

grep "^>" will only print lines from the files that belong to header or that start with ">"

awk will then take that line and split it using "9806" as a delimited and print second element.

If there is a "|" between 9806 and Gene_Symbol. Then you can use it as a delimiter too.

You need to pipe this now: awk '{split($0,a,"|"); print a[7]}'

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks for you suggestion dear..but I have multiple different Tax_ID for each entry. Then what should I do for that.

ADD REPLY • link 10.8 years ago by Abdul Rawoof ▴ 60

0

Entering edit mode

Do you have | character between your tax_id and Gene Symbol as I mentioned above. In that case, you can use awk '{split($0,a,"|"); print a[7]}' as I suggested. Otherwise you can use = as delimiter and print 3rd element like grep "^>" protein.fasta | awk '{split($0,a,"="); print a[3]}' to get the gene symbol.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks Ashutosh...your suggession are helpful for me..but I have did this with excel also..and I got it easily...Thanks again...

ADD REPLY • link 10.8 years ago by Abdul Rawoof ▴ 60

Ram · Answer 3 · 2014-07-10

1

Entering edit mode

10.8 years ago

Phil S. ▴ 700

If your header looks like the one mentioned above you can go with:

grep "^>" foo.fasta | awk '{split($0,a," "); split(a[3],b,"="); print b[2]}'

this will output "NOTCH"

if you want to have something like:

Gene_Symbol=NOTCH

do this:

grep "^>" foo.fasta | awk '{split($0,a," "); print a[3]}'

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Phil S. ▴ 700

Ram · Answer 4 · 2014-07-17

1

Entering edit mode

10.8 years ago

Kenosis ★ 1.3k

Here are a couple of Perl options:

use strict;
use warnings;

while (<>) {
    print "$1\n" if /(Gene_Symbol.+)/;
}

Command-line usage: perl script.pl inFile.fasta >IPIGenes.txt

As a one-liner:

perl -lne 'print $1 if /(Gene_Symbol.+)/' inFile.fasta >IPIGenes.txt

Output from both on your dataset:

Gene_Symbol=NOTCH Kinase NotchRas

There's no need for the usual check for a fasta header, viz., /^>/, since Gene_Symbol will only appear in the header.

Hope this helps!

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Hey hi...this is a great help for me...I was also trying to do like that but failed..and I did it with excel easily...

Thanks,

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Abdul Rawoof ▴ 60

Ram · Answer 5 · 2015-04-02

1

Entering edit mode

10.1 years ago

Prakki Rama ★ 2.7k

My way(:)

$fname="input.fasta";

open(IN,$fname) or die "Not correct file!!";

while(<IN>)
{
    if($_=~/\>.+(Gene_Symbol=.+)$/)
    {
    print "$1\n";
    }
}

close(IN);

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.1 years ago by Prakki Rama ★ 2.7k