Question

High Detection of Duplication in RNAseq dataset

0

Entering edit mode

4 months ago

karlensberg • 0

I have an RNAseq cohort of mouse tumors which were induced by virus infection. The issue I'm running into is I'm detecting SUPER high levels of sequence duplication and I'm not sure if I should assume that the duplication is biological or an artifact of PCR or something (I know there's no direct answer but I'm not sure how to move forward). For reference we used the NEBnext directional, paired end library prep kit with a ribosomal RNA depletion step. Unfortunately the kit we used did not incorporate UMIs, so we are unable to distinguish between artifacts that way. And also if it's informative we used the maximum amount of RNA as input (1ug) and did the minimum number of PCR cycles as recommended by the kit (which I think was 8). First off when I run FastQC on the samples I'm detecting high duplication levels in some levels and super high levels in others as depicted in the images below.

enter image description here

I was concerned about this but I was also aware that fastQC only takes one fastq file into consideration at a time (not both) so I carried on running cutadapt, followed by STAR. In an effort to get a better understanding of the levels of sequence duplication I ran the markdup command by samtools in one of our worst samples and the output was unfortunately in agreement with fastqc:

COMMAND: samtools markdup -f markdup_stats.txt -d 2500 -@ 5 - GRCm38Dedupe.Aligned.sortedByCoord.out.bam
READ: 268422239
WRITTEN: 268422239
EXCLUDED: 76459218
EXAMINED: 191963021
PAIRED: 191963008
SINGLE: 13
DUPLICATE PAIR: 164180024
DUPLICATE SINGLE: 13
DUPLICATE PAIR OPTICAL: 15364500
DUPLICATE SINGLE OPTICAL: 1
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 164180037
DUPLICATE TOTAL: 164180037
ESTIMATED_LIBRARY_SIZE: 13915916

I guess my question is do you have advice as to what I should do moving forward? Should I not do the deduplication because I'm losing so many reads, or do I have to? Is it possible these reads are biological and not just PCR artifacts?

Deduplication RNAseq • 1.0k views

ADD COMMENT • link updated 5 weeks ago by i.sudbery 21k • written 4 months ago by karlensberg • 0

score 4 · Answer 1 · 2024-09-11

4

Entering edit mode

4 months ago

i.sudbery 21k

The first sample I'd probably not worry about, but you've definately got a problem in that second sample (I'm assuming that is what the samtools output is from). Unfortunately, I don't think removing duplicates is a particularly good option here, just feels like that is a dud sample.

However, I would be slightly wary of diagnosing the problem as PCR duplication. With only 8 rounds of PCR, the maximum number of decendants of an original molecule is 256, yet in that sample you have >100K copies (some of these won't be true copies, as this is single end duplication), but even so, this is probably too high.

Two things I would check:

Your fastQC suggests you are failing adaptor content checks. What fraction of reads have adaptor? Is is possible that your duplicates are actaully adaptor dimers?
What is the ribosomal contamination? How many of these reads map to either rRNA genes, or ribosomal protein genes?

I'll just finally note that 130M read pairs for a buik sample, is LOT of reads.

ADD COMMENT • link 4 months ago by i.sudbery 21k

1

Entering edit mode

130M reads is fairly typical for a bulk total RNA prep really. You pick up a ton more transcripts without the polyA selection, and if you want to look at alternative splicing or identify transcripts de novo, the additional reads really help.

ADD REPLY • link 4 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thanks for the response! The adaptor content ranges between ~10-30% on the tail end of the reads in the samples. Heres a representative sample that we had:

enter image description here

This is why I began incorporating cutadapt into my workflow for processing these samples. I'm fairly confident that it's not primer dimers as all of the samples were ran on bioanalyzer and I'm working under the impression that a pretty clear peak would show up if we were getting primer dimers. Also, even if there were primer dimers I'm assuming that they would be removed in the cutadapt step. If it helps heres the command I'm running for that:

cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-j 10 -o AC05trimmed.R1.fastq.gz -p AC05trimmed.R2.fastq.gz \
AC05_1.fastq.gz AC05_2.fastq.gz \
-m 15

and the output for one of the samples:

Total read pairs processed: 117,228,555
Read 1 with adapter: 43,501,641 (37.1%)
Read 2 with adapter: 42,090,827 (35.9%)
Pairs written (passing filters):  117,228,555 (100.0%)

Total basepairs processed: 35,403,023,610 bp
Read 1: 17,701,511,805 bp
Read 2: 17,701,511,805 bp
Total written (filtered): 33,299,771,975 bp (94.1%)
Read 1: 16,629,715,004 bp
Read 2: 16,670,056,971 bp

In terms of ribosomal contamination I tried to get this info but I'm not sure if the way I did it works. Basically I tried generating an rRNA bed file from the pre-existing gtf file I had by filtering for entries that contained 'gbkey "rRNA"' (I'm using the GRCm38 reference). Then I used bedtools intersect to find all of the reads in my bam file that overlapped with my bed file:

bedtools intersect -a GRCm38Aligned.sortedByCoord.out.bam -b B6.rRNA.GRCm38.bed -wa > GRCm38.rRNA.Aligned.sortedByCoord.out.bam

and then I ran idxstats from samtools to get a summary of the output. From this I only got around 2.8 M reads. If the way I got to this number is wrong let me know.

Assuming that it's not primer dimers or ribosomal contamination, does that just mean the samples are duds? Also does 130 M read pairs just mean that we sequenced at a very high depth? Sorry I still very new to all this, so I'm trying to get a better idea of how everything works. Thank you again for the help!!

ADD REPLY • link 4 months ago by karlensberg • 0

1

Entering edit mode

Have you looked at the bed file you generated? In my experience, GRCh39 for example does not annotate rRNAs so you might be missing a bunch. Silva is a place to start but doesn't look like it includes all of the 5S isoforms.

ADD REPLY • link 4 months ago by Trivas ★ 1.8k

0

Entering edit mode

Yeah I looked at the BED file, I see a lot of 5S rRNA isoforms, 2 mitochondria rRNAs, and one 28s rRNA annotation. I don't know if I should expect to also see 18s or if thats a precursor that's typically not annotated.

ADD REPLY • link 4 months ago by karlensberg • 0

0

Entering edit mode

You should absolutely find the annotation for 18S, I like using this figure to see rRNA processing: https://www.researchgate.net/figure/Repeat-unit-of-the-45S-rRNA-gene-tandemly-repeated-in-human-chr-13-14-15-21-and_fig11_38019447

You can also look at the mitochondrial genome and look for an annotation for MT-RNR1/2 corresponding to 18S and 16S respectively to verify

ADD REPLY • link 4 months ago by Trivas ★ 1.8k

0

Entering edit mode

Yeah I have the mt-rnr1 and mt-rnr2 genes in my bed file.

ADD REPLY • link 4 months ago by karlensberg • 0

0

Entering edit mode

However, I would be slightly wary of diagnosing the problem as PCR duplication. With only 8 rounds of PCR, the maximum number of decendants of an original molecule is 256, yet in that sample you have

100K copies (some of these won't be true copies, as this is single end duplication), but even so, this is probably too high.

Good analysis. Although he seems to just be guessing that it was 8.

Anyway, RNA-seq is rife with duplicates. To get rid of some of the artificial ones, you can use BBTools:

clumpify.sh in=reads.fq out=deduped.fq dedupe optical ddist=50 passes=6

The specific distance that you use for optical duplicate removal is platform-specific. 50 is fine for NovaSeqX, NextSeq, and HiSeq 2000, 1T, and 2500. Consult the shellscript documentation for other platforms, it's highly variable...

That's generally a safe operation to reduce Illumina-specific artifactual duplicates. You can always get rid of all duplicates with "dedupe" leaving out "optical" but that makes no sense for quantitative RNA-seq experiments.

Doing any kind of PCR amplification on quantitative RNA-seq experiments will bias your results, so it should be avoided if possible, but if you have to do it, you have to do it. You can't remove the artifacts with digital deduplication, though, without compromising your results.

ADD REPLY • link 4 months ago by Brian Bushnell 20k

score 0 · Answer 2 · 2024-12-10

The high sequence duplication you’re observing in your RNA-Seq data can be a challenge, especially without unique molecular identifiers (UMIs) to distinguish PCR artifacts from biological duplicates. Here are some steps you can take to evaluate and address the issue:

Assess Biological Duplication: Since your tumors are induced by viral infection, high biological duplication is plausible, particularly if the tumors contain clonal populations or high-abundance transcripts. You can check for biological duplication by visualizing the distribution of read counts across genes. If specific genes or viral transcripts dominate, this suggests biological duplication.

Deduplication Considerations: Given the high number of duplicates, deduplication might lead to significant data loss. If the duplicates are biological, removing them could negatively impact your results. Instead, consider analyzing the data with and without deduplication to compare outcomes.

Optimize rRNA Depletion: Residual ribosomal RNA (rRNA) can contribute to high duplication levels. To improve rRNA depletion in future experiments, consider using Zymo Research’s PureRec Duplex-Specific Nuclease (DSN). PureRec DSN efficiently removes rRNA and highly abundant transcripts, enhancing library complexity and reducing PCR biases. This can help mitigate duplication issues and provide more balanced, informative data.

Library Complexity Metrics: Review metrics like estimated library size (which you’ve calculated) and complexity curves to understand if your library is sufficiently diverse. If complexity is low, increasing input RNA or optimizing library prep conditions may help.

Biological Replicates: If possible, analyze multiple biological replicates to see if duplication patterns are consistent across samples. This can help differentiate experimental artifacts from true biological signals.

In summary, if the duplication appears biological (e.g., specific transcripts are overrepresented), you may want to proceed without deduplication. However, optimizing rRNA depletion with PureRec DSN in future experiments can improve library complexity and reduce duplication caused by PCR artifacts.

Hope this helps!