Hi all,
Any script available for doing sequence length disrtibution of fastq file?
Thanks, Deepthi
Hi all,
Any script available for doing sequence length disrtibution of fastq file?
Thanks, Deepthi
Here is a solution using awk
:
awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' file.fastq
It reads like this: every second line in every group of 4 lines (the sequence line), measure the length of the sequence and increment the array cell corresponding to that length. When all lines have been read, loop over the array to print its content. In awk
, arrays and array cells are initialized when they are called for the first time, no need to initialize them before.
Another awk solution, very similar to Frédéric's:
cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt
And to quickly obtain a graph in R:
reads<-read.csv(file="read_length.txt", sep="", header=FALSE)
plot (reads$V2,reads$V1,type="l",xlab="read length",ylab="occurences",col="blue")
Another possibility, from BBMap:
readlength.sh in=reads.fq out=histogram.txt
The default is 10bp bins with a max of 80kbp, but those can be configured (run the shellscript with no arguments for details). It's very fast, and handles fastq/fasta/sam/bam; raw/gzip/bzip2.
Save it as fastq_lenght.py and run it :)
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
import sys
######################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python fastq_lenght.py file.fastq
result: .txt file same name as input name plus "_lenghts.txt"
------------------------------------------------------------------------------------
'''
######################################################################################
if len(sys.argv) != 2:
print syntax
sys.exit()
######################################################################################
fastq_file = open(sys.argv[1], 'r')
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'lenghts.txt', 'w')
seq = ''
for line in fastq_file:
line = line.rstrip('\n')
if line.startswith('@'):
if seq:
outfile.write(name + '\t' + str(len(seq)) + '\n')
seq = ""
name = line
else:
seq = line
outfile.write(name + '\t' + str(len(seq)) + '\n')
fastq_file.close()
outfile.close()
print '\n' + '\t' + 'File: ' + prefix + '_' + 'lenghts.txt has been created...'
You can do this using fastx toolkit, especially, you might be interested in this program:
fastx_nucleotide_distribution_graph.sh
and maybe:
fastx_quality_stats
fastq_quality_boxplot_graph.sh
Use this post as a help Fastq Quality Read and Score Length Check and apply a few basic unix commands like sort and uniq to get the frequency.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello,
I am trying to do something similar, were I want to determine both the read length and how many reads are in my fastq file. Here is my code:
But I keep getting the following error:
I'm not sure what I've done incorrectly? I also tried Frederic's code above and although I got that to run, its not exactly the output I'm seeking, I should be returning something like 2420797 100
Any help would be super appreciated!
Between print and length should be at least one space. If you want to compute the number of reads simultaneously, use something like Frèdèric's approach
Thanks! This was a winner.