Question

Are not all duplicated reads removed when applying removal duplicated algorithms?

0

Entering edit mode

3.5 years ago

ManuelDB ▴ 110

I am trying to understand a bit deeper how duplications occur and how to deal with that in NGS analysis. First, of all, I wanted to understand the FastQC read duplication report for which the tutorial of Istvan Albert is really good (Revisiting the FastQC read duplication report).

My FASTQ file has shown this report

enter image description here

The title shows the proportion of duplicated read what is (as far I can undertand) so high. I have run Rmdup and MarkDuplicate in this file and the proportion of duplicated reads detected and removed/marked is around 15%.

So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?

My second question is, for the simple simulation that Istvan Albert does in his post, I can understand what the red and blue lines is telling me. However, what my red and blue lines are telling me when working in a more realistic scenario like this (e.g. why is there a pick between 9 and >10)?

ngs FASTQC • 4.1k views

ADD COMMENT • link updated 3.5 years ago by d-cameron ★ 3.0k • written 3.5 years ago by ManuelDB ▴ 110

1

Entering edit mode

Two comments:

Keep in mind that removing duplicates is not always needed/allowed. So it's not because you see a fraction of duplicated reads one needs to remover duplicates. It all depends on your data, ore specifically if you run RNAseq, then there should be duplication !!! (or your counts will be pretty low)
There is also a binning factor in play here. You can see the plot is on a per 1 up to 9 , the next bin is 10-49 so you will always see a bump there which might have nothing to do with duplicated reads but just because it's a very large bin compared to the ones on the left.

ADD REPLY • link 3.5 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for your answer. This is for clinical analysis and GATK best practices recommend removing duplicates.

I am still wondering if having 50% of duplicates reads is normal, and why FASTQC says I have 48%, but with RmDup and MarkDuplicates, only the 15% are removed???

ADD REPLY • link 3.5 years ago by ManuelDB ▴ 110

1

Entering edit mode

indeed for such an analysis it's advised to remove duplicates.

duplication rate is also dependent on biological complexity of your sample(s) combined with the sequencing depth.

Why one removes or detects more duplication than another tool I don't really know but it is very well possible they all have different definitions of duplication and/or different levels of sensitivity to pick them up

ADD REPLY • link 3.5 years ago by lieven.sterck 15k

1

Entering edit mode

What type of libraries do you have? Is this RNaseq, shotgun whole genome, exome? For RNAseq and exome, it would be normal do have a high duplication rate.

ADD REPLY • link 3.5 years ago by h.mon 35k

1

Entering edit mode

To do a true estimation of duplicates you can use clumpify.sh from BBMap suite which does purely sequence based analysis, no alignments needed like other tools you mention. It can do perfect matches and can work on paired-end reads at the same time.

See: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY • link 3.5 years ago by GenoMax 153k

1

Entering edit mode

3.5 years ago

d-cameron ★ 3.0k

So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?

They are not. samtools rmdup and picard MarkDuplicates both use read alignments to determine duplicate reads. If two reads with identical sequence are mapped to different locations in the genome (e.g. they come from a L1HS LINE element, or an alpha satellite repeat) then the duplicate removal tools will fail to remove them. Last time I checked, they also had problems correctly reduplicating split read alignments.

This doesn't explain a 15% vs 50% difference though. Did you do any fastq compression? Tools like BBMap clumpify.sh will reorder the fastq so similar sequences appear near each other - an approach that doesn't play nicely with a duplication rate estimation tool that expects the reads to be random.

why is there a pick between 9 and >10

Because >10 contains 11,12,13,14,15,...,50. That peaks is what you would expect to see based on the duplicate profile we observe for the 1-9 bins in that plot.

ADD COMMENT • link 3.5 years ago by d-cameron ★ 3.0k

0

Entering edit mode

are you sure what you post here?

So if no genome available those tools can't do anything? Moreover, if two reads have the exact same sequence they should align to the same region unless they assigned random because no unique mapping exists.

ADD REPLY • link 3.5 years ago by lieven.sterck 15k

1

Entering edit mode

So if no genome available those tools can't do anything?

Correct. Duplicate marking tools such as samtools & picard use a sorted BAM file as input. If there is no genome then those tool will indeed fail to run because the input file isn't going to be in a valid format.

if two reads have the exact same sequence they should align to the same region unless they assigned random because no unique mapping exists.

No popular aligner uses this approach. They will either a) not map them (if the entire read consists of very highly abundant sequences above the aligner-defined seeding threshold (bwa mem default is 500)), or b) randomly assign to one of the homologs. The reason aligners don't systematically assign to the same location is because it stuffs up CNV calling. If they did what you are suggesting, the coverage of a diploid genome would go haywire at every repeat. The 'random' assignment is usually a determinist pseudo-random one so rerunning on the same input fastq will always give you the same alignments.

ADD REPLY • link 3.5 years ago by d-cameron ★ 3.0k

score 2 · Accepted Answer · 2022-03-14

The issue here is a somewhat hidden (in plain sight, but hidden anyways) feature of FastQC and Istvan's code: duplicates are estimated over the first 50 bp of a read, the rest is ignored - actually, FastQC and Istvan's implementations probably are not identical, as we will see bellow.

The FastQC docs have a verbal description of the details of the duplication module. Two important parts are:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis.

Istvan's code says the fastq file has been processed with:

cat data.fq \
  | bioawk -c fastx '{ print substr($seq,1, 50) } ' \
  | sort \
  | uniq -c \
  | sort -k1,1 -rn \
  > data.uniq.txt

This states every read of the fastq file has been cut down to 50 bp to perform the analysis - in opposition to FastQC's, which processes the first 100K reads, with reads longer than 75 bp being cut down to 50 bp.

Anyways, and to reinforce the point, FastQC and Istvan's duplicate estimation do not take into account the whole read, so may show duplicates even if the original fastq does not contain duplicates.