Revisiting the FastQC read duplication report

Tutorial:Revisiting the FastQC read duplication report

70

Entering edit mode

10.7 years ago

Istvan Albert 102k

With a new release of FastQC the post titled So What Does The Sequence Duplication Rate Really Mean In A Fastqc Report has lost its relevance. This is a followup and a short discussion of the new plots and their interpretation.

The new plots now contain two different curves and the meaning of the percentage has also changed. The explanations in the docs are little bit lacking to make sure I got it right I wrote a python implementation (see the end) that produces the same plots.

I found it helpful to use the term "distinct" sequences rather than unique sequences as this latter term seems to imply to some that those sequences are present only once in the data. So distinct sequences are defined as the largest subset of sequences where no two sequences are identical.

Thus distinct sequences = number of singletons (sequences that appear only once) + number of doubles (number of sequences that appear twice but each double will be counted only once) + number of triplets (sequences that appear three times but each will be counted once) ... and so on.

The percentage in the title is computed as the distinct/total * 100

The blue line represents the counts of all the sequences that are duplicated at a given rate. The percentage is computed relative to the total number of reads.

The red line represents the number of distinct sequences that are duplicated at a given rate. The percentage is computed relative to the total number of distinct sequences in the data.

Let's take two examples where each contain 20 reads:

Case 1: 10 unique reads + 5 reads each present twice (duplicates)
Case 2: 10 unique reads + 1 read present 10 times

Case 1 shown in the upper plot will lead to 15 distinct reads and thus 15/20=75% percent remaining, the number of singletons is 1x10 =10 and the number of doubles is 5x2 =10 therefore the blue line has a plateau at those rates. The 15 distinct sequences are distributed as 10 singletons and 5 duplicates, 10/15=66% and 5/15=33% is the slope of the red line.

Case 2 will produce 11 distinct reads and therefore 11/20=55% will be the precent remaining reads. Again the total number of reads is equally distributed between the two cases but this time the peak will be at 10 since we have one read duplicated 10 times and that produces 10 sequences. But there are 11 total groups where 10/11=91% are singletons and 1/11=9% of the groups form at duplication rate of 10x.

Below is the python code that was used to plot the above.

	#
	# FastQC style de-duplication stats and plot
	#
	#
	# The input file for this program needs to be generated via the command line with
	# a command like so:
	#
	# cat data.fq \| bioawk -c fastx '{ print substr($seq,1, 50) } ' \| sort \| uniq -c \| sort -k1,1 -rn > data.uniq.txt
	#
	# see bioawk at: https://github.com/lh3/bioawk
	#

	import sys
	import matplotlib.pyplot as plt

	def get_count(line):
	"Function to extract the count from a line in the file"
	count, rest = line.strip().split(" ")
	return int(count)

	fname = "data.uniq.txt"

	# Get the counts into a list of number
	counts = map(get_count, open(fname))

	# These are the total number of reads.
	total = sum(counts)

	# These are the number of distinct sequences.
	distinct = len(counts)

	# This is the number of singletons
	single = len(filter(lambda x: x == 1, counts))

	print "Total:%s, Distinct:%s, Singleton:%s" % (total, distinct, single)

	# Generate the breaks
	lower = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, 500, 1000, 5000, 10000]
	upper = lower[1:] + [sys.maxint]

	y_sum, y_grp = [], []

	for lo, hi in zip(lower, upper):

	# Condition for a value to fall into a break.
	cond = lambda x: lo <= x < hi

	# The data that passes the condition
	vals = filter(cond, counts)

	# Collect either the sum or reads or number of groups
	y_sum.append(sum(vals))
	y_grp.append(len(vals))

	x_val = range(len(lower))
	x_tick = map(str, lower)

	y_sum = [100.0 * x / total for x in y_sum]
	y_grp = [100.0 * x / distinct for x in y_grp]

	p_sum, = plt.plot(y_sum, '-', lw=3, color='blue')
	p_count, = plt.plot(y_grp, '-', lw=3, color='red')
	plt.xticks(x_val, x_tick)
	plt.yticks(range(0, 100, 10))
	plt.grid(True)
	plt.title("Percent of seq remaining if deduplicated %4.2f%%" % (100.0 * distinct / total))
	plt.legend([p_sum, p_count], ["Total", "Distinct"])
	plt.show()

view raw fastqc-style-deduplication.py hosted with ❤ by GitHub

fastqc • 43k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Istvan Albert 102k

0

Entering edit mode

After going through your post (which is very informative indeed) I went through the FastQC documentation: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html

It states that "The plot shows the proportion of the library which is made up of sequences in each of the different duplication level bins. There are two lines on the plot. The red line takes the full sequence set and shows how its duplication levels are distributed. In the blue plot the sequences are de-duplicated and the proportions shown are the proportions of the deduplicated set which come from different duplication levels in the original data."

I think they have exchanged the definitions of the red and blue lines or am I wrong?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.1 years ago by shreygandhi1990 • 0

1

Entering edit mode

One of the reasons that went ahead and I generated these plots (and the code for them )was that I did not understand the explanations in the help module. Note how the red line is also labeled "de-duplicated sequences" on the plot itself. I could not figure it out what that meant.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.1 years ago by Istvan Albert 102k

4

Entering edit mode

I contacted Simon Andrews on this subject, because I didn't understand the meaning of "de-duplicated sequences" and he gave me a link where there is a good explanation of that:

http://proteo.me.uk/2013/09/a-new-way-to-look-at-duplication-in-fastqc-v0-11/

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.0 years ago by a.kmg ▴ 70

Login before adding your answer.