How to match a FASTA header for extraction using Perl?
2
0
Entering edit mode
6.1 years ago

Hi!

So I have a FASTA file containing sequences, I want to replace old FASTA headers with new ones, and the first step to do so is to match with the header names. It's the name I want the match with, so after the '>'. How do I do this? All sequences have headers somewhat like this:

>Halobacterium_salinarum

This is the part of the code where I find the headers:

     while (my $line = <$IN>) {  if ($line =~ /^>/) {
     my $x =           # Here I want to match with "Halobacterium_salinarum" 
                       # and all the other different species names

I have tried for hours to find out in the right match characters. Is it "any word character": \w? I also want to save the old species name in a hash, then I should save it like this: (\w+) and finish with \s cause thats where the name ends, right?

Perl • 2.9k views
ADD COMMENT
0
Entering edit mode

Try the script form following article.

https://www.perlmonks.org/?node_id=975419

ADD REPLY
0
Entering edit mode

So, people still use Perl for Bioinformatics!

ADD REPLY
0
Entering edit mode

Probably using bioperl will ease your life:

use Bio::SeqIO;
use strict;
use warnings;

my $fasta  = Bio::SeqIO->new(-file => $file , -format => 'Fasta');
while ( my $seq = $fasta->next_seq() ) {
  my $header = $seq->id;
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
  }
}
ADD REPLY
1
Entering edit mode
6.0 years ago
Juke34 8.9k
while (my $line = <$IN>) {
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
 }
}
ADD COMMENT
0
Entering edit mode
6.0 years ago
JC 13k

The \w in Perl matches any alphanumeric char and the underscore, and using (\w+) should match any word and stop to the first no-word char (space or new line). If you want to save this in a hash:

#!/usr/bin/perl

use strict;
use warnings;

my %species = ();
while (<>) {
    if ( m/^>(\w+)/ ) {
         $species{$1}++;
}

print "Species\tCount\n";
while (my ($sp, $cnt) = each %species) {
    print "$sp\t$cnt\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6