perl count fasta sequences
2
0
Entering edit mode
9.4 years ago
cabraham03 ▴ 30

I have the next code that extract the sequences from a FASTQ file and convert in FASTA file; and the end make a count of the sequences available in the fasta file; the problem is the count is not Correct, the result is always less than the sequences available in the fasta output file.

FIRST SCRIPT

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Term::ANSIColor;

# Variables
#--------------------------------------------------------------------------------------------------------------------------
my ($imput, $output, $line, $usage, $help_desciption, $help, $options, $op_desciption);


GetOptions (
            'i=s' => \$imput,
            'o=s' => \$output,
            'h'   => \$help,
            'op'  => \$options
            );


# help
#--------------------------------------------------------------------------------------------------------------------------
$help_desciption = (qq(
                    This Script extract the sequences from a FASTQ file and convert it to FASTA format
                    The sequences must be in the same file. It will generate a fasta file with the sequences
                    in the same order and name that are available in the imput file.

                    Example:
                    fastxQA -i file.fastq -o file.fasta
                        or  for more options see
                    fastxQA -op
                    ));

if ($help) {
    print "$help_desciption\n\n";
    exit;
}

# options
#--------------------------------------------------------------------------------------------------------------------------
$op_desciption = (qq(
           -i  = input  file in a .fastq format                     required
           -o  = output file in a .fa, .fas, .fasta or .txt format  required
           -h  = help                                               optional
           -op = options                                            optional
           ));


if ($options) {
    print "$op_desciption\n\n";
    exit;
}

# Error
#--------------------------------------------------------------------------------------------------------------------------

$usage = (qq(
          Error:
             Wrong Arguments

             Usage:
             fastxQA -i infile.fastq -o utfile.fasta

));

if (!$imput or !$output) {
    print color("red"), "$usage", color("reset"),"\n\n";
    exit;
}

# Open Data
#--------------------------------------------------------------------------------------------------------------------------
open FASTQIN, '<', "$imput" or die (color("red"), "\n\t\tCan't open $imput file\n\t\t\t or\n\t\t Does not exist such file", color("reset"),"\n\n");
open FASTAOUT, '>', "$output" or die (color("red"), "Can't genenate $output file", color("reset"),"\n\n");

# Processing Message
#--------------------------------------------------------------------------------------------------------------------------
print "\n";
$| = 1;
print colored("\t\t  This Process Will Take a While  ", "blue blink");
#sleep 2;
print "\r";


# Extract sequences
#--------------------------------------------------------------------------------------------------------------------------
while(
defined(my $head = <FASTQIN>) &&
defined(my $seq = <FASTQIN>) &&
defined(my $qhead = <FASTQIN>) &&    # te
defined(my $quality = <FASTQIN>)     # cuarta linea
){
  substr($head, 0, 1, '>'); #0 elimina el primer caracter (@) de la primer linea; el 1 remplaza la cantidad de caracteres, este caso solo 1 caracter


  print FASTAOUT $head, $seq;

}

# Count sequences "HAY UN ERROR DA 10 SEQUENCIAS MENOS "
#--------------------------------------------------------------------------------------------------------------------------
open COUNTERSEQ, '<', "$output" or die (color("red"), "Can't open $output file", color("reset"),"\n\n");
my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g ) 
       {
             $count++; 
    }
}

# Close and Exit
#--------------------------------------------------------------------------------------------------------------------------
  close FASTQIN;
  close FASTAOUT;
  close COUNTERSEQ;
  print color("blue"),"                                      DONE                               \n\n", color("reset");
  print color("blue"),"              -  $count  sequences were converted to   $output File- \n\n", color("reset");
  exit;
  #--------------------------------------------------------------------------------------------------------------------------------------

However when I try the code for counting separately it works well:

SECOND SCRIPT

#!/usr/bin/perl -w
use strict;

my $infile =$ARGV[0];
open COUNTERSEQ, '<', $infile, or die;

my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g )
       {
             $count++;
    }
}

print "$count\n";

Does anybody can help me to say what is the error in the connection in the first Script among the extracting section and the count section!?

Thanks so much!

fasta sequence perl • 4.5k views
ADD COMMENT
1
Entering edit mode

I didn't understand what you're trying to do, but if all you want is to count the number of sequences in a fasta file, you can just:

grep -c ">" input.fasta
ADD REPLY
0
Entering edit mode

The first script extract the sequences from FASTQ to FASTA and the the final message just print the total number of the sequences that are available in the FASTA file!

Thanks

ADD REPLY
2
Entering edit mode

You could just do it with:

#fastq to fasta
cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'
#Count fasta
grep -c ">" input.fasta
ADD REPLY
0
Entering edit mode

I think

grep -A 1 "@" | sed 's/^@/>/g'

would more or less do the trick, no?

ADD REPLY
1
Entering edit mode

No, because "@" is 31 in Phred+33 scale (even "^@" wont work, there might be a base at the beginning with a "@" score)

ADD REPLY
3
Entering edit mode
9.4 years ago
thackl ★ 3.0k

Try closing FASTAOUT before opening COUNTERSEQ. This might be a buffering issue.

ADD COMMENT
0
Entering edit mode

Thanks So much; It works well CLOSING before !!!

THANKS TO ALL OF YOU !!

ADD REPLY
1
Entering edit mode
9.4 years ago

If this is a programming learning experience:

How To Parse Fasta Files In Perl

otherwise:

grep ">" your.fasta | wc -l
ADD COMMENT
1
Entering edit mode

Do not miss the quotations marks or you will be sad.

ADD REPLY
0
Entering edit mode

or even:

grep -c '>' your.fasta
ADD REPLY

Login before adding your answer.

Traffic: 1737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6