Question

How to match a FASTA header for extraction using Perl?

0

Entering edit mode

6.8 years ago

Mimmi Ahlmén ▴ 30

Hi!

So I have a FASTA file containing sequences, I want to replace old FASTA headers with new ones, and the first step to do so is to match with the header names. It's the name I want the match with, so after the '>'. How do I do this? All sequences have headers somewhat like this:

>Halobacterium_salinarum

This is the part of the code where I find the headers:

     while (my $line = <$IN>) {  if ($line =~ /^>/) {
     my $x =           # Here I want to match with "Halobacterium_salinarum" 
                       # and all the other different species names

I have tried for hours to find out in the right match characters. Is it "any word character": \w? I also want to save the old species name in a hash, then I should save it like this: (\w+) and finish with \s cause thats where the name ends, right?

Perl • 3.2k views

ADD COMMENT • link updated 6.7 years ago by JC 13k • written 6.8 years ago by Mimmi Ahlmén ▴ 30

0

Entering edit mode

Try the script form following article.

https://www.perlmonks.org/?node_id=975419

ADD REPLY • link 6.8 years ago by Arup Ghosh 3.3k

0

Entering edit mode

So, people still use Perl for Bioinformatics!

ADD REPLY • link 6.7 years ago by Santosh Anand 5.8k

0

Entering edit mode

Probably using bioperl will ease your life:

use Bio::SeqIO;
use strict;
use warnings;

my $fasta  = Bio::SeqIO->new(-file => $file , -format => 'Fasta');
while ( my $seq = $fasta->next_seq() ) {
  my $header = $seq->id;
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
  }
}

ADD REPLY • link 6.7 years ago by Juke34 9.3k

score 1 · Answer 1 · 2018-11-23

1

Entering edit mode

6.7 years ago

Juke34 9.3k

while (my $line = <$IN>) {
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
 }
}

ADD COMMENT • link 6.7 years ago by Juke34 9.3k

score 0 · Answer 2 · 2018-11-23

The \w in Perl matches any alphanumeric char and the underscore, and using (\w+) should match any word and stop to the first no-word char (space or new line). If you want to save this in a hash:

#!/usr/bin/perl

use strict;
use warnings;

my %species = ();
while (<>) {
    if ( m/^>(\w+)/ ) {
         $species{$1}++;
}

print "Species\tCount\n";
while (my ($sp, $cnt) = each %species) {
    print "$sp\t$cnt\n";
}