I have the next code that extract the sequences from a FASTQ file and convert in FASTA file; and the end make a count of the sequences available in the fasta file; the problem is the count is not Correct, the result is always less than the sequences available in the fasta output file.
FIRST SCRIPT
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Term::ANSIColor;
# Variables
#--------------------------------------------------------------------------------------------------------------------------
my ($imput, $output, $line, $usage, $help_desciption, $help, $options, $op_desciption);
GetOptions (
'i=s' => \$imput,
'o=s' => \$output,
'h' => \$help,
'op' => \$options
);
# help
#--------------------------------------------------------------------------------------------------------------------------
$help_desciption = (qq(
This Script extract the sequences from a FASTQ file and convert it to FASTA format
The sequences must be in the same file. It will generate a fasta file with the sequences
in the same order and name that are available in the imput file.
Example:
fastxQA -i file.fastq -o file.fasta
or for more options see
fastxQA -op
));
if ($help) {
print "$help_desciption\n\n";
exit;
}
# options
#--------------------------------------------------------------------------------------------------------------------------
$op_desciption = (qq(
-i = input file in a .fastq format required
-o = output file in a .fa, .fas, .fasta or .txt format required
-h = help optional
-op = options optional
));
if ($options) {
print "$op_desciption\n\n";
exit;
}
# Error
#--------------------------------------------------------------------------------------------------------------------------
$usage = (qq(
Error:
Wrong Arguments
Usage:
fastxQA -i infile.fastq -o utfile.fasta
));
if (!$imput or !$output) {
print color("red"), "$usage", color("reset"),"\n\n";
exit;
}
# Open Data
#--------------------------------------------------------------------------------------------------------------------------
open FASTQIN, '<', "$imput" or die (color("red"), "\n\t\tCan't open $imput file\n\t\t\t or\n\t\t Does not exist such file", color("reset"),"\n\n");
open FASTAOUT, '>', "$output" or die (color("red"), "Can't genenate $output file", color("reset"),"\n\n");
# Processing Message
#--------------------------------------------------------------------------------------------------------------------------
print "\n";
$| = 1;
print colored("\t\t This Process Will Take a While ", "blue blink");
#sleep 2;
print "\r";
# Extract sequences
#--------------------------------------------------------------------------------------------------------------------------
while(
defined(my $head = <FASTQIN>) &&
defined(my $seq = <FASTQIN>) &&
defined(my $qhead = <FASTQIN>) && # te
defined(my $quality = <FASTQIN>) # cuarta linea
){
substr($head, 0, 1, '>'); #0 elimina el primer caracter (@) de la primer linea; el 1 remplaza la cantidad de caracteres, este caso solo 1 caracter
print FASTAOUT $head, $seq;
}
# Count sequences "HAY UN ERROR DA 10 SEQUENCIAS MENOS "
#--------------------------------------------------------------------------------------------------------------------------
open COUNTERSEQ, '<', "$output" or die (color("red"), "Can't open $output file", color("reset"),"\n\n");
my $count = 0;
while( <COUNTERSEQ> ){
chomp $_;
if($_=~ /^>/g )
{
$count++;
}
}
# Close and Exit
#--------------------------------------------------------------------------------------------------------------------------
close FASTQIN;
close FASTAOUT;
close COUNTERSEQ;
print color("blue")," DONE \n\n", color("reset");
print color("blue")," - $count sequences were converted to $output File- \n\n", color("reset");
exit;
#--------------------------------------------------------------------------------------------------------------------------------------
However when I try the code for counting separately it works well:
SECOND SCRIPT
#!/usr/bin/perl -w
use strict;
my $infile =$ARGV[0];
open COUNTERSEQ, '<', $infile, or die;
my $count = 0;
while( <COUNTERSEQ> ){
chomp $_;
if($_=~ /^>/g )
{
$count++;
}
}
print "$count\n";
Does anybody can help me to say what is the error in the connection in the first Script among the extracting section and the count section!?
Thanks so much!
I didn't understand what you're trying to do, but if all you want is to count the number of sequences in a fasta file, you can just:
The first script extract the sequences from FASTQ to FASTA and the the final message just print the total number of the sequences that are available in the FASTA file!
Thanks
You could just do it with:
I think
would more or less do the trick, no?
No, because "@" is 31 in Phred+33 scale (even "^@" wont work, there might be a base at the beginning with a "@" score)