High Detection of Duplication in RNAseq dataset
2
0
Entering edit mode
3 months ago

I have an RNAseq cohort of mouse tumors which were induced by virus infection. The issue I'm running into is I'm detecting SUPER high levels of sequence duplication and I'm not sure if I should assume that the duplication is biological or an artifact of PCR or something (I know there's no direct answer but I'm not sure how to move forward). For reference we used the NEBnext directional, paired end library prep kit with a ribosomal RNA depletion step. Unfortunately the kit we used did not incorporate UMIs, so we are unable to distinguish between artifacts that way. And also if it's informative we used the maximum amount of RNA as input (1ug) and did the minimum number of PCR cycles as recommended by the kit (which I think was 8). First off when I run FastQC on the samples I'm detecting high duplication levels in some levels and super high levels in others as depicted in the images below.

enter image description here

enter image description here

I was concerned about this but I was also aware that fastQC only takes one fastq file into consideration at a time (not both) so I carried on running cutadapt, followed by STAR. In an effort to get a better understanding of the levels of sequence duplication I ran the markdup command by samtools in one of our worst samples and the output was unfortunately in agreement with fastqc:

COMMAND: samtools markdup -f markdup_stats.txt -d 2500 -@ 5 - GRCm38Dedupe.Aligned.sortedByCoord.out.bam
READ: 268422239
WRITTEN: 268422239
EXCLUDED: 76459218
EXAMINED: 191963021
PAIRED: 191963008
SINGLE: 13
DUPLICATE PAIR: 164180024
DUPLICATE SINGLE: 13
DUPLICATE PAIR OPTICAL: 15364500
DUPLICATE SINGLE OPTICAL: 1
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 164180037
DUPLICATE TOTAL: 164180037
ESTIMATED_LIBRARY_SIZE: 13915916

I guess my question is do you have advice as to what I should do moving forward? Should I not do the deduplication because I'm losing so many reads, or do I have to? Is it possible these reads are biological and not just PCR artifacts?

Deduplication RNAseq • 933 views
ADD COMMENT
4
Entering edit mode
3 months ago

The first sample I'd probably not worry about, but you've definately got a problem in that second sample (I'm assuming that is what the samtools output is from). Unfortunately, I don't think removing duplicates is a particularly good option here, just feels like that is a dud sample.

However, I would be slightly wary of diagnosing the problem as PCR duplication. With only 8 rounds of PCR, the maximum number of decendants of an original molecule is 256, yet in that sample you have >100K copies (some of these won't be true copies, as this is single end duplication), but even so, this is probably too high.

Two things I would check:

  1. Your fastQC suggests you are failing adaptor content checks. What fraction of reads have adaptor? Is is possible that your duplicates are actaully adaptor dimers?
  2. What is the ribosomal contamination? How many of these reads map to either rRNA genes, or ribosomal protein genes?

I'll just finally note that 130M read pairs for a buik sample, is LOT of reads.

ADD COMMENT
1
Entering edit mode

130M reads is fairly typical for a bulk total RNA prep really. You pick up a ton more transcripts without the polyA selection, and if you want to look at alternative splicing or identify transcripts de novo, the additional reads really help.

ADD REPLY
0
Entering edit mode

Thanks for the response! The adaptor content ranges between ~10-30% on the tail end of the reads in the samples. Heres a representative sample that we had:

enter image description here

This is why I began incorporating cutadapt into my workflow for processing these samples. I'm fairly confident that it's not primer dimers as all of the samples were ran on bioanalyzer and I'm working under the impression that a pretty clear peak would show up if we were getting primer dimers. Also, even if there were primer dimers I'm assuming that they would be removed in the cutadapt step. If it helps heres the command I'm running for that:

cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-j 10 -o AC05trimmed.R1.fastq.gz -p AC05trimmed.R2.fastq.gz \
AC05_1.fastq.gz AC05_2.fastq.gz \
-m 15 

and the output for one of the samples:

Total read pairs processed: 117,228,555
Read 1 with adapter: 43,501,641 (37.1%)
Read 2 with adapter: 42,090,827 (35.9%)
Pairs written (passing filters):  117,228,555 (100.0%)

Total basepairs processed: 35,403,023,610 bp
Read 1: 17,701,511,805 bp
Read 2: 17,701,511,805 bp
Total written (filtered): 33,299,771,975 bp (94.1%)
Read 1: 16,629,715,004 bp
Read 2: 16,670,056,971 bp

In terms of ribosomal contamination I tried to get this info but I'm not sure if the way I did it works. Basically I tried generating an rRNA bed file from the pre-existing gtf file I had by filtering for entries that contained 'gbkey "rRNA"' (I'm using the GRCm38 reference). Then I used bedtools intersect to find all of the reads in my bam file that overlapped with my bed file:

bedtools intersect -a GRCm38Aligned.sortedByCoord.out.bam -b B6.rRNA.GRCm38.bed -wa > GRCm38.rRNA.Aligned.sortedByCoord.out.bam

and then I ran idxstats from samtools to get a summary of the output. From this I only got around 2.8 M reads. If the way I got to this number is wrong let me know.

Assuming that it's not primer dimers or ribosomal contamination, does that just mean the samples are duds? Also does 130 M read pairs just mean that we sequenced at a very high depth? Sorry I still very new to all this, so I'm trying to get a better idea of how everything works. Thank you again for the help!!

ADD REPLY
1
Entering edit mode

Have you looked at the bed file you generated? In my experience, GRCh39 for example does not annotate rRNAs so you might be missing a bunch. Silva is a place to start but doesn't look like it includes all of the 5S isoforms.

ADD REPLY
0
Entering edit mode

Yeah I looked at the BED file, I see a lot of 5S rRNA isoforms, 2 mitochondria rRNAs, and one 28s rRNA annotation. I don't know if I should expect to also see 18s or if thats a precursor that's typically not annotated.

ADD REPLY
0
Entering edit mode

You should absolutely find the annotation for 18S, I like using this figure to see rRNA processing: https://www.researchgate.net/figure/Repeat-unit-of-the-45S-rRNA-gene-tandemly-repeated-in-human-chr-13-14-15-21-and_fig11_38019447

You can also look at the mitochondrial genome and look for an annotation for MT-RNR1/2 corresponding to 18S and 16S respectively to verify

ADD REPLY
0
Entering edit mode

Yeah I have the mt-rnr1 and mt-rnr2 genes in my bed file.

ADD REPLY
0
Entering edit mode

However, I would be slightly wary of diagnosing the problem as PCR duplication. With only 8 rounds of PCR, the maximum number of decendants of an original molecule is 256, yet in that sample you have

100K copies (some of these won't be true copies, as this is single end duplication), but even so, this is probably too high.

Good analysis. Although he seems to just be guessing that it was 8.

Anyway, RNA-seq is rife with duplicates. To get rid of some of the artificial ones, you can use BBTools:

clumpify.sh in=reads.fq out=deduped.fq dedupe optical ddist=50 passes=6

The specific distance that you use for optical duplicate removal is platform-specific. 50 is fine for NovaSeqX, NextSeq, and HiSeq 2000, 1T, and 2500. Consult the shellscript documentation for other platforms, it's highly variable...

That's generally a safe operation to reduce Illumina-specific artifactual duplicates. You can always get rid of all duplicates with "dedupe" leaving out "optical" but that makes no sense for quantitative RNA-seq experiments.

Doing any kind of PCR amplification on quantitative RNA-seq experiments will bias your results, so it should be avoided if possible, but if you have to do it, you have to do it. You can't remove the artifacts with digital deduplication, though, without compromising your results.

ADD REPLY
0
Entering edit mode
16 days ago
jnechacov • 0

The high sequence duplication you’re observing in your RNA-Seq data can be a challenge, especially without unique molecular identifiers (UMIs) to distinguish PCR artifacts from biological duplicates. Here are some steps you can take to evaluate and address the issue:

Assess Biological Duplication: Since your tumors are induced by viral infection, high biological duplication is plausible, particularly if the tumors contain clonal populations or high-abundance transcripts. You can check for biological duplication by visualizing the distribution of read counts across genes. If specific genes or viral transcripts dominate, this suggests biological duplication.

Deduplication Considerations: Given the high number of duplicates, deduplication might lead to significant data loss. If the duplicates are biological, removing them could negatively impact your results. Instead, consider analyzing the data with and without deduplication to compare outcomes.

Optimize rRNA Depletion: Residual ribosomal RNA (rRNA) can contribute to high duplication levels. To improve rRNA depletion in future experiments, consider using Zymo Research’s PureRec Duplex-Specific Nuclease (DSN). PureRec DSN efficiently removes rRNA and highly abundant transcripts, enhancing library complexity and reducing PCR biases. This can help mitigate duplication issues and provide more balanced, informative data.

Library Complexity Metrics: Review metrics like estimated library size (which you’ve calculated) and complexity curves to understand if your library is sufficiently diverse. If complexity is low, increasing input RNA or optimizing library prep conditions may help.

Biological Replicates: If possible, analyze multiple biological replicates to see if duplication patterns are consistent across samples. This can help differentiate experimental artifacts from true biological signals.

In summary, if the duplication appears biological (e.g., specific transcripts are overrepresented), you may want to proceed without deduplication. However, optimizing rRNA depletion with PureRec DSN in future experiments can improve library complexity and reduce duplication caused by PCR artifacts.

Hope this helps!

ADD COMMENT
0
Entering edit mode

Dear @jnechacov thank you for joining our community.

We note that all of the posts you have added in the 12 hours since you joined mention a particular product. Note that while there is no community ban on recommending products which you are associated with, associations must be excplicitly disclosed for posts not to be marked as spam.

Secondly, the use of generative AI in composing posts is again, not outside the rules of the community, but posts using generative AI must be specifically useful to the question or risk being deleted as off-topic. Your answer has a structure common to answers from many generative AI tools, such as chatGPT. Your answer here is superficially relevant, so I'll leave it up for now. But be aware that obviously GPT generated answers are not likely to engender confidence in your product recommendations.

ADD REPLY

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6