Question

Perl Program For Reading Fastq Format Files

0

Entering edit mode

14.0 years ago

Ssonia ▴ 70

I have to write a perl program which extract short sequences with its identification number and create different output files in FASTA format for each short sequence.

@DRR001110.1 GEKPHI202I3KU0 length=159
CTGCTGTATGTGTCCGTTGTCGTATATGTCCGTGTCGGTAGTATGTCCGTTGACGTGTAATCGTGGGGTCCTTAGGGTCGATAGTGTCCGTGTGGANGTNTCCGTTNAGGTCANTGGGCTGGNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNN
+DRR001110.1 GEKPHI202I3KU0 length=159
HHHHHHHFFFFFHGB:88?ABBBAA8844223252/--//,////599=11///////--//45//1/855--.,,,,.,..-.-289952221-.!33!21-.--!.--.3.!.111..--
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@DRR001110.2 GEKPHI202F5PBM length=117
ATTGACTCGCATATACGATTCCATGGTTCCGAGCTTTACCTAACTAGCTCAACTACGCTAGTCTACGCGTGGTCGGCGGTCTCTCAAGGCGCACAGGGGAGTAGGNNNNNNNNNNNN
+DRR001110.2 GEKPHI202F5PBM length=117
?==<<<<<<<4448<A?===>>??8888668:6766./..02623244:<>>??>8444<4343424224---.155../3996665513153--,,,,-,7811!!!!!!!!!!!!

fastq format conversion • 11k views

ADD COMMENT • link updated 14.0 years ago by lh3 33k • written 14.0 years ago by Ssonia ▴ 70

4

Entering edit mode

Hi @ssonia. It would be more respectful of the users of this forum if you took some time to ask and format your question properly. Write a short and descriptive titles. Put the question in the question box. Finally, do not hesitate to put in more context. What are you doing? Why? What have you tried so far? In this manner, you will get more useful answers and your question will in turn be much more useful to future users of the forum. Cheers

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

3

Entering edit mode

@ssonia - I don't want to delete if someone here can help you out, but I think you need to show what you have tried so far, and some idea of exactly what you want the script to do. Just asking someone to write code for you is always going to get pretty short shrift, but showing you have tried, but got stuck is more likely to lead to a useful answer, both for you and for other visitors to the site (which is the aim).

ADD REPLY • link 14.0 years ago by Simon Cockell 7.4k

2

Entering edit mode

as it currently stands, this is not a question, and I'm of a mind to close it. No value is being offered here to other users.

ADD REPLY • link 14.0 years ago by Simon Cockell 7.4k

0

Entering edit mode

well,Thanks for all comments if question seems still to be unclear and not useful kindly delete it. I tried to delete but looks no option for deleting a question only can edit it , iam sorry I cant get more clear it was just a simple question of bioperl program, anyway thanks all and kindly delete or closed it.

ADD REPLY • link 14.0 years ago by Ssonia ▴ 70

score 5 · Answer 1 · 2011-08-31

5

Entering edit mode

14.0 years ago

Nikolay Vyahhi ★ 1.3k

See fastqpl and BioPerl. Please, clarify the question, so we can help. It's not clear what's the question exactly is

ADD COMMENT • link 14.0 years ago by Nikolay Vyahhi ★ 1.3k

Ram · Answer 2 · 2011-09-01

I have seen more questions on parsing fastq and I am a little disappointed by the efficiency of the BioPerl and BioPython fastq parsers (see also my answer to this question). In the end, I decided to port my fasta/q parser to Perl/Python. This is readfq. It works for multi-line fastq and is by far faster than bioperl. The following is an example. To use readfq, you only need to copy-and-paste the `readfq()` function. The Perl version has not been commented or optimized.

sub readfq {
    my ($fh, $aux) = @_;
    @$aux = [undef, 0] if (!defined(@$aux));
    return if ($aux->[1]);
    if (!defined($aux->[0])) {
        while (<$fh>) {
            chomp;
            if (substr($_, 0, 1) eq '>' || substr($_, 0, 1) eq '@') {
                $aux->[0] = $_;
                last;
            }
        }
        if (!defined($aux->[0])) {
            $aux->[1] = 1;
            return;
        }
    }
    my $name = /^.(\S+)/? $1 : '';
    my $seq = '';
    my $c;
    $aux->[0] = undef;
    while (<$fh>) {
        chomp;
        $c = substr($_, 0, 1);
        last if ($c eq '>' || $c eq '@' || $c eq '+');
        $seq .= $_;
    }
    $aux->[0] = $_;
    $aux->[1] = 1 if (!defined($aux->[0]));
    return ($name, $seq) if ($c ne '+');
    my $qual = '';
    while (<$fh>) {
        chomp;
        $qual .= $_;
        if (length($qual) >= length($seq)) {
            $aux->[0] = undef;
            return ($name, $seq, $qual);
        }
    }
    $aux->[1] = 1;
    return ($name, $seq);
}

my @aux = undef;
my ($n, $slen, $qlen) = (0, 0, 0);
while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) {
    ++$n;
    $slen += length($seq);
    $qlen += length($qual) if ($qual);
}
print join("\t", $n, $slen, $qlen), "\n";

score 3 · Answer 3 · 2011-08-31

3

Entering edit mode

14.0 years ago

Michael 56k

How about fastq_to_fasta? I think that is exactly what you are asking for (except it's not a perl script).

ADD COMMENT • link 14.0 years ago by Michael 56k

score 2 · Answer 4 · 2011-08-31

I had a script close to what you want so I modified it a little bit and I think it now will give what you want; one fastq file for each sequence in the input fastq file.

One drawback of this script is that it expects each sequence to be only 4 lines long (i.e. no wrapping of the sequences). It's highly possible that your files really only contain 4 lines per sequence (as I understand from the sample you copy/pasted in the beginning) but you should check them again to be sure!

#!/bin/perl -w

use strict;

open ( IN, $ARGV[0] ) or die;

my $counter = 0;
while ( my $in_line = <IN> ) {

    # assuming only 4 lines per sequence!
    my $sequence = "";

    # add the sequence header
    $sequence .= $in_line;

    # add the sequence
    $in_line = <IN>;
    $sequence .= $in_line;

    # add the quality header
    $in_line = <IN>;
    $sequence .= $in_line;

    # add the quality
    $in_line = <IN>;
    $sequence .= $in_line;

    open ( OUT, ">sequence.$counter.fastq" ) or die;
    print OUT $sequence;
    close OUT;

    $counter++;
}

Simon Cockell · Answer 5 · 2011-08-31

0

Entering edit mode

14.0 years ago

Ssonia ▴ 70

I think my asking question was only wrong as nobody is getting what I actually wanted the script to do, sorry for these. I have a look around to other questions over here and almost similar type question was there asked with title - Split Fastq file into different files only comprising one chromosome each.

And answer was - seqretsplit -sformat fastq-sanger -osformat fastq test.fastq

which actually was splitting the sequences

So far I was trying it with Bioperl

#!/bin/perl -w

use Bio::SeqIO;

$in = Bio::SeqIO->new(-file=> "test.fastq", -format =>"fastq"

                          );

$out = Bio::SeqIO->new(-file=>">test", -format =>"fasta");

while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

which was giving all sequences in one file not in separate.

ADD COMMENT • link updated 14.0 years ago by Simon Cockell 7.4k • written 14.0 years ago by Ssonia ▴ 70

0

Entering edit mode

Problem there is that you've only given $out a single filename. Try

while(my $seq = $in->nextseq()){$out = Bio::SeqIO->new(-file => $seq->id(), -format => "fasta"); $out->writeseq($seq) }

ADD REPLY • link 14.0 years ago by Russh ★ 1.2k

0

Entering edit mode

otherwise the code is perfect

ADD REPLY • link 14.0 years ago by Russh ★ 1.2k

0

Entering edit mode

Its showing an error -MSG: Could not open DRR001110.1: No such file or directory

ADD REPLY • link 14.0 years ago by Ssonia ▴ 70

0

Entering edit mode

yeah, sorry, you'd need new(-file => ">" . $seq->id(), -format => "fasta"); but from your code above, I can see you knew that though