Counting number of reads and bases fastQ with python

1

Entering edit mode

6.9 years ago

Chvatil ▴ 140

Hi all, does someone know a way to count a many base are in a fastaQ file and how many read as well? (R1 and R2 fastQ files from illumina).

And if it is possible to do it with python?

Thanks for you help :)

fastq python biopython • 8.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 6.9 years ago by Chvatil ▴ 140

3

Entering edit mode

Many other ways described here: https://bioinformatics.stackexchange.com/questions/935/fast-way-to-count-number-of-reads-and-number-of-bases-in-a-fastq-file

This is about as easy as it can get: (reformat.sh from BBMap suite).

$ reformat.sh in=reads.fq
Input:                          4000000 reads           297852680 bases

Time:                           2.339 seconds.

ADD REPLY • link 6.9 years ago by GenoMax 150k

0

Entering edit mode

6.9 years ago

finswimmer 16k

Hello Toto26,

here is a naiv python implementation:

	import gzip

	reads = 0
	bases = 0

	with gzip.open('your.fastq.gz', 'rb') as read:
	for id in read:
	seq = next(read)
	reads += 1
	bases += len(seq.strip())
	next(read)
	next(read)

	print("reads", reads)
	print("bases", bases)

view raw count_reads.py hosted with ❤ by GitHub

And here another one using unix standard commands:

$ zcat your.fastq.gz|paste - - - -|cut -f2|wc -c -l|awk -v OFS="\n" '{print "reads: "$1, "bases: "$2-$1}'

What the command is doing is:

zcat decompress the gziped fastq file
paste puts 4 lines in a row (delimited by tab) as a typically fastq file have 4 lines per read.
cut extract the second column as this column contains now the sequence of the read
wc counts the total number of characters (= total number of bases + line breaks) and the total number of lines (= total number of reads)
we can use awk to subtract the number of lines from the number of characters calculated by wc, as wc counts the line breaks as a characters

Edit:

I knew that loops in python are slow, but it's amazing how slow. For 1.5Gb large file with 31434394 reads and 2355832933 bases the python solution above needs around 2 minutes to finish. Using zcat etc. runs in 30 second. So for testing I modified the python code in that way, that it reads from stdin serving the result of zcat your.fastq.gz|paste - - - -, split the line and calculate read length. Now even python runs in 30 seconds.

More speed is possible with using pigz instead of zcat. The runtime goes down to 23 seconds.

fin swimmer

ADD COMMENT • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

6.9 years ago

drkennetz ▴ 560

This should give you what you are looking for:

#!/usr/bin/env python

import sys
fastq = sys.argv[1]
with open(fastq, 'r') as f:
    reads = []
    for line in f:
        if line.startswith("@"):
            reads.append(line)
print("The total number of reads is: ", len(reads))
totalReads = len(reads)
bases = []
with open(fastq, 'r') as f:
    for line in f:
        if line.startswith("A") or line.startswith("T") or line.startswith("C") or line.startswith("G"):
            numBases = line.split()
            bases.append(numBases)
flatList = [item for sublist in bases for item in sublist] # split a list of lists into a single list
total = []
for i in flatList:
    totalBases = list(i)
    count = 0
    count += (len(totalBases))
    total.append(count)
allBases = sum(total)
print("The total number of bases is:", allBases)
avgReadLength = allBases/totalReads
print("The avg read length is:", avgReadLength)

To run the code:

$ python fastqcount.py youfastqfile.fastq

The first block of "with open" code opens your fastq read only and checks to see if the line in the file starts with @ (every header line in fastq files starts with @. If it does, it appends it to a list called reads, then we just print the length of the list which gives the number of reads. We then initialize a new list called bases. We use the if statement to see if the line starts with a base (A,T,C,G) and if it does it adds the read sequence to a list called bases. But this list is currently a list of lists, with each list in the main list being a read sequence. We split this into 1 flat list, so we no longer have a list of lists, but rather a list of sequences. We then initialize a list called total, and iterate through each sequence of the list called flatList. To split the big list called flatList into a list of individual bases we use "totalBases = list(i)", then we get the length of each sequence. We do this by initializing a variable called count, and for every sequence in totalBases, we get the length and add it to count. We then put all of the different sequence lengths into the list of total with the "total.append(count)". Lastly, we define a variable called allBases and we just sum the list total. I stuck the average read length function on there as well in case you wanted that. I saw some answers above, but if you are asking this question you are probably pretty new to python. This may help you understand what each step above is doing without using too many built-in functions.

ADD COMMENT • link 6.9 years ago by drkennetz ▴ 560

0

Entering edit mode

Hello drkennetz,

good work for taking the time to describe what your code is doing. It animate me to add some comments to my post. Keep on writing!

Please let me comment on two problems I see in your code:

You assume that the fastq file is not gz compressed. That is possible, of course. But most of the time a user have a compressed version of it.
You are trying to identify the lines that hold sequences by looking whether the lines starts with A, T, C or G. But it is also possible that a sequence have an N. Also the quality encoding line have these characters. So it could happend that you count this line as well.

fin swimmer

ADD REPLY • link 6.9 years ago by finswimmer 16k

Login before adding your answer.