Hi all, does someone know a way to count a many base are in a fastaQ file and how many read as well? (R1 and R2 fastQ files from illumina).
And if it is possible to do it with python?
Thanks for you help :)
Hi all, does someone know a way to count a many base are in a fastaQ file and how many read as well? (R1 and R2 fastQ files from illumina).
And if it is possible to do it with python?
Thanks for you help :)
Hello Toto26,
here is a naiv python implementation:
And here another one using unix standard commands:
$ zcat your.fastq.gz|paste - - - -|cut -f2|wc -c -l|awk -v OFS="\n" '{print "reads: "$1, "bases: "$2-$1}'
What the command is doing is:
zcat
decompress the gziped fastq filepaste
puts 4 lines in a row (delimited by tab) as a typically fastq file have 4 lines per read.cut
extract the second column as this column contains now the sequence of the readwc
counts the total number of characters (= total number of bases + line breaks) and the total number of lines (= total number of reads)awk
to subtract the number of lines from the number of characters calculated by wc
, as wc
counts the line breaks as a charactersEdit:
I knew that loops in python are slow, but it's amazing how slow. For 1.5Gb large file with 31434394 reads and 2355832933 bases the python solution above needs around 2 minutes to finish. Using zcat
etc. runs in 30 second. So for testing I modified the python code in that way, that it reads from stdin serving the result of zcat your.fastq.gz|paste - - - -
, split the line and calculate read length. Now even python runs in 30 seconds.
More speed is possible with using pigz
instead of zcat
. The runtime goes down to 23 seconds.
fin swimmer
This should give you what you are looking for:
#!/usr/bin/env python
import sys
fastq = sys.argv[1]
with open(fastq, 'r') as f:
reads = []
for line in f:
if line.startswith("@"):
reads.append(line)
print("The total number of reads is: ", len(reads))
totalReads = len(reads)
bases = []
with open(fastq, 'r') as f:
for line in f:
if line.startswith("A") or line.startswith("T") or line.startswith("C") or line.startswith("G"):
numBases = line.split()
bases.append(numBases)
flatList = [item for sublist in bases for item in sublist] # split a list of lists into a single list
total = []
for i in flatList:
totalBases = list(i)
count = 0
count += (len(totalBases))
total.append(count)
allBases = sum(total)
print("The total number of bases is:", allBases)
avgReadLength = allBases/totalReads
print("The avg read length is:", avgReadLength)
To run the code:
$ python fastqcount.py youfastqfile.fastq
The first block of "with open" code opens your fastq read only and checks to see if the line in the file starts with @ (every header line in fastq files starts with @. If it does, it appends it to a list called reads, then we just print the length of the list which gives the number of reads. We then initialize a new list called bases. We use the if statement to see if the line starts with a base (A,T,C,G) and if it does it adds the read sequence to a list called bases. But this list is currently a list of lists, with each list in the main list being a read sequence. We split this into 1 flat list, so we no longer have a list of lists, but rather a list of sequences. We then initialize a list called total, and iterate through each sequence of the list called flatList. To split the big list called flatList into a list of individual bases we use "totalBases = list(i)", then we get the length of each sequence. We do this by initializing a variable called count, and for every sequence in totalBases, we get the length and add it to count. We then put all of the different sequence lengths into the list of total with the "total.append(count)". Lastly, we define a variable called allBases and we just sum the list total. I stuck the average read length function on there as well in case you wanted that. I saw some answers above, but if you are asking this question you are probably pretty new to python. This may help you understand what each step above is doing without using too many built-in functions.
Hello drkennetz,
good work for taking the time to describe what your code is doing. It animate me to add some comments to my post. Keep on writing!
Please let me comment on two problems I see in your code:
fastq
file is not gz compressed. That is possible, of course. But most of the time a user have a compressed version of it.fin swimmer
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Many other ways described here: https://bioinformatics.stackexchange.com/questions/935/fast-way-to-count-number-of-reads-and-number-of-bases-in-a-fastq-file
This is about as easy as it can get: (
reformat.sh
from BBMap suite).