Question

How To Write A Perl Script To Parse Fastq Files

0

Entering edit mode

11.5 years ago

freddy ▴ 40

I am undergrad learning Perl programming for bioinformatics, and having problems writing a script. I wanted to know if any one has script that counts the number of sequences in a fastq but excludes everything else such as the line that begins with @ and + and the quality score. Thanks

perl script fastq • 9.2k views

ADD COMMENT • link updated 11.5 years ago by JacobS ▴ 990 • written 11.5 years ago by freddy ▴ 40

2

Entering edit mode

First try it yourself. If it doesnt work, try these posts. They are pretty relevant to what you are trying to do Parse fastq file - pad reads with N's

Fastq Quality Read and Score Length Check

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

  You can follow this approach:
  Step 1: use next if for skipping any line 
  Step 2: take a variable as string and concatenate every line to this strings then take the length of whole string.
  Always remember that perl always read file line by line

ADD REPLY • link 11.5 years ago by always_learning ★ 1.1k

score 0 · Answer 1 · 2013-10-11

Due to the vagaries of the FastQ definition parsing it correctly is a little trickier than one might initially think. This is due to the fact that the symbol indicating the start of a new record @ is also a valid choice for quality measure. Thus one always needs to keep track of the lenght of the sequence and use that as guidance on how many quality measures to read in.

On the other hand most fastq files tend to be formatted with the entire sequence (and quality) on a single line. In these cases parsing is trivial as it turns into the problem of correctly identifying the 4 lines that form a fastq record. For example to find the sequences use a modulo division to find the remainder and print the line if the remainder is equal to 1 (assuming that the line numbering starts at 0).

All data produced by short read sequencers are in this latter format.

score 0 · Answer 2 · 2013-10-12

0

Entering edit mode

11.5 years ago

Lee Katz ★ 3.2k

I made such a script here. You can see how I parsed it.

http://www.cbcb.umd.edu/software/PBcR/data/run_assembly_convertMultiFastqToStandard.pl

It converts multi-line fastq file entries to a four-line style.

ADD COMMENT • link 11.5 years ago by Lee Katz ★ 3.2k

score 0 · Answer 3 · 2013-10-13

@freddy Here is my super simple answer for a beginner using perl to count the number of reads in a fastq file. Of course there are perl and awk one-liners that get the job done, but the following script really spells things out for you. All you are doing is reading through every 4 lines and increasing the total read count by 1. The following code can be copy and pasted to make a complete perl script:

#!/usr/bin/perl -w ## This specifies the file as a perl script

$line_position = 0;

$count = 0;

open(INPUT,$ARGV[0]) || die("Can't open file"); ## This opens the first user-privded argument as the input file.

while(<INPUT>) ## This reads each line of the file in, one at a time

{

$line_position++;

if($line_position == 4) ## If the script has already seen 4 lines, then reset the line counter and add 1 to the read counter!

{

$line_position = 0;

$count++;

}

close(INPUT);

print"Number of reads: $count\n"; ## This pring out your read counts!

An easier method would be to just do a UNIX line count on the file and divide by 4. wc -l myFile.fastq. This is all assuming you have a standard .fastq that has 4 lines per read. Good luck!