How To Write A Perl Script To Parse Fastq Files
4
0
Entering edit mode
11.2 years ago
freddy ▴ 40

I am undergrad learning Perl programming for bioinformatics, and having problems writing a script. I wanted to know if any one has script that counts the number of sequences in a fastq but excludes everything else such as the line that begins with @ and + and the quality score. Thanks

perl script fastq • 9.0k views
ADD COMMENT
2
Entering edit mode

First try it yourself. If it doesnt work, try these posts. They are pretty relevant to what you are trying to do Parse fastq file - pad reads with N's

Fastq Quality Read and Score Length Check

ADD REPLY
0
Entering edit mode
  You can follow this approach:
  Step 1: use next if for skipping any line 
  Step 2: take a variable as string and concatenate every line to this strings then take the length of whole string.
  Always remember that perl always read file line by line
ADD REPLY
0
Entering edit mode
11.2 years ago

Due to the vagaries of the FastQ definition parsing it correctly is a little trickier than one might initially think. This is due to the fact that the symbol indicating the start of a new record @ is also a valid choice for quality measure. Thus one always needs to keep track of the lenght of the sequence and use that as guidance on how many quality measures to read in.

On the other hand most fastq files tend to be formatted with the entire sequence (and quality) on a single line. In these cases parsing is trivial as it turns into the problem of correctly identifying the 4 lines that form a fastq record. For example to find the sequences use a modulo division to find the remainder and print the line if the remainder is equal to 1 (assuming that the line numbering starts at 0).

All data produced by short read sequencers are in this latter format.

ADD COMMENT
0
Entering edit mode
11.2 years ago
Lee Katz ★ 3.2k

I made such a script here. You can see how I parsed it.

http://www.cbcb.umd.edu/software/PBcR/data/run_assembly_convertMultiFastqToStandard.pl

It converts multi-line fastq file entries to a four-line style.

ADD COMMENT
0
Entering edit mode
11.2 years ago
JacobS ▴ 990

@freddy Here is my super simple answer for a beginner using perl to count the number of reads in a fastq file. Of course there are perl and awk one-liners that get the job done, but the following script really spells things out for you. All you are doing is reading through every 4 lines and increasing the total read count by 1. The following code can be copy and pasted to make a complete perl script:

#!/usr/bin/perl -w ## This specifies the file as a perl script

$line_position = 0;

$count = 0;

open(INPUT,$ARGV[0]) || die("Can't open file"); ## This opens the first user-privded argument as the input file.

while(<INPUT>) ## This reads each line of the file in, one at a time

{

$line_position++;

if($line_position == 4) ## If the script has already seen 4 lines, then reset the line counter and add 1 to the read counter!

{

$line_position = 0;

$count++;

}

}

close(INPUT);

print"Number of reads: $count\n"; ## This pring out your read counts!

An easier method would be to just do a UNIX line count on the file and divide by 4. wc -l myFile.fastq. This is all assuming you have a standard .fastq that has 4 lines per read. Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6