Check If A File Is In Fasta Format
3
0
Entering edit mode
12.6 years ago

I am trying to write a code which asks for a file (if the first time, an invalid filename is given, it asks for file 5 times until exhausting), then it checks if the file is in fasta format.

how to code that? I have the following code so far.

#!/usr/bin/perl -w
#A program that asks for a file, opens it if file exists and check
#if the file is in FASTA format
use strict;
#get data from a file
my @file = openfile();

#open file
#subroutines
sub openfile {
my $filename;
my $x;
    my $datafile;
    my $file;

for  ($x = 0; $x<5; $x++) {
print "\n\nPlease enter file name: ";
chomp ($filename = <STDIN>);

if (-e $filename) {
print "File found!\n\n";
    exit;
         } else {
        if ($x<4) {
        print "Invalid file name!\n\n";
        } else { 
                print "Five tries were unsuccessful! Please check and try again!\n\n";
                        }
                    }
                }
        return;
        }
fasta perl • 12k views
ADD COMMENT
12
Entering edit mode
12.6 years ago
Neilfws 49k

First, there is no need to reinvent the wheel. Use the SeqIO module from Bioperl:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "myfile.fa", -format => "fasta");
while(my $seq = $seqio->next_seq) {
  # do stuff with sequences...
}

If the fasta file is invalid, this code will throw an exception, for example:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')

Second, don't waste time checking for multiple incorrect attempts. Once is enough :)

ADD COMMENT
0
Entering edit mode

Thanks!!! I will try it out!

ADD REPLY
2
Entering edit mode
12.6 years ago

You could create a simple grammar for FASTA using GNU-Bison:

%{
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

int yylex();
int yyerror( char* message);
%}
%error-verbose
%token LT OTHER SYMBOL CR
%start input
%%

input:   input  sequence | optspaces sequence;
sequence: head body optspaces;
head: LT anylist CR | LT CR;
anylist: anylist any | any;
any: LT | OTHER | SYMBOL;
body: symbols CR | body symbols CR ;
symbols: symbols symbol | symbol ;
symbol: SYMBOL;
optspaces: | crlist;
crlist: crlist CR | CR;

%%
int yyerror( char* message)
    {
    fprintf(stderr,"NOT A FASTA %s\n",message);
    exit(EXIT_FAILURE);
    return -1;
    }
int yylex()
    {
    int c=fgetc(stdin);
    switch(c)
        {
        case EOF: return c;
        case '>' : return LT;
        case '\n' : return CR;
        default: return isalpha(c)?SYMBOL:OTHER;
        }
    }

int main(int argc, char** argv)
    {
    return yyparse();
    }

and use it to test if a file is a fasta file:

#compile
bison fasta.y
gcc -Wall -O3 fasta.tab.c

#test
$ ./a.out < ~/file.xml
NOT A FASTA syntax error, unexpected OTHER, expecting LT

$ ./a.out < ~/rotavirus.fasta
$
ADD COMMENT
2
Entering edit mode

coming up with a set of these for popular formats would actually be a nice addition to any Makefile or similar pipeline: "BioValidators by Pierre"

ADD REPLY
1
Entering edit mode
12.6 years ago
ngsgene ▴ 380

Unless I am reading this too straightforwardly, you simply need to add an if to test if a file is in fasta format with condition if($l ~=/&gt;/)

if the line ($l) contains > you're good to go.

while(my $l = <DAT>) {
    chomp $l;
    if ($l ~= />/) {
    do this
    }
    elsif ($ !~ />/) {
    do this
    }
}
ADD COMMENT
1
Entering edit mode

This is a little simplistic. You should at least check whether > is at the start of the first line, using /^>/. Also, there should be a check for no space after >. And then there is the problem of valid sequence lines.

ADD REPLY
1
Entering edit mode

True, its more useful for parsing a fasta file when you know its a fasta file.

ADD REPLY

Login before adding your answer.

Traffic: 2648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6