I have an RNAseq cohort of mouse tumors which were induced by virus infection. The issue I'm running into is I'm detecting SUPER high levels of sequence duplication and I'm not sure if I should assume that the duplication is biological or an artifact of PCR or something (I know there's no direct answer but I'm not sure how to move forward). For reference we used the NEBnext directional, paired end library prep kit with a ribosomal RNA depletion step. Unfortunately the kit we used did not incorporate UMIs, so we are unable to distinguish between artifacts that way. And also if it's informative we used the maximum amount of RNA as input (1ug) and did the minimum number of PCR cycles as recommended by the kit (which I think was 8). First off when I run FastQC on the samples I'm detecting high duplication levels in some levels and super high levels in others as depicted in the images below.
I was concerned about this but I was also aware that fastQC only takes one fastq file into consideration at a time (not both) so I carried on running cutadapt, followed by STAR. In an effort to get a better understanding of the levels of sequence duplication I ran the markdup command by samtools in one of our worst samples and the output was unfortunately in agreement with fastqc:
COMMAND: samtools markdup -f markdup_stats.txt -d 2500 -@ 5 - GRCm38Dedupe.Aligned.sortedByCoord.out.bam
READ: 268422239
WRITTEN: 268422239
EXCLUDED: 76459218
EXAMINED: 191963021
PAIRED: 191963008
SINGLE: 13
DUPLICATE PAIR: 164180024
DUPLICATE SINGLE: 13
DUPLICATE PAIR OPTICAL: 15364500
DUPLICATE SINGLE OPTICAL: 1
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 164180037
DUPLICATE TOTAL: 164180037
ESTIMATED_LIBRARY_SIZE: 13915916
I guess my question is do you have advice as to what I should do moving forward? Should I not do the deduplication because I'm losing so many reads, or do I have to? Is it possible these reads are biological and not just PCR artifacts?
130M reads is fairly typical for a bulk total RNA prep really. You pick up a ton more transcripts without the polyA selection, and if you want to look at alternative splicing or identify transcripts de novo, the additional reads really help.
Thanks for the response! The adaptor content ranges between ~10-30% on the tail end of the reads in the samples. Heres a representative sample that we had:
This is why I began incorporating cutadapt into my workflow for processing these samples. I'm fairly confident that it's not primer dimers as all of the samples were ran on bioanalyzer and I'm working under the impression that a pretty clear peak would show up if we were getting primer dimers. Also, even if there were primer dimers I'm assuming that they would be removed in the cutadapt step. If it helps heres the command I'm running for that:
and the output for one of the samples:
In terms of ribosomal contamination I tried to get this info but I'm not sure if the way I did it works. Basically I tried generating an rRNA bed file from the pre-existing gtf file I had by filtering for entries that contained 'gbkey "rRNA"' (I'm using the GRCm38 reference). Then I used bedtools intersect to find all of the reads in my bam file that overlapped with my bed file:
and then I ran idxstats from samtools to get a summary of the output. From this I only got around 2.8 M reads. If the way I got to this number is wrong let me know.
Assuming that it's not primer dimers or ribosomal contamination, does that just mean the samples are duds? Also does 130 M read pairs just mean that we sequenced at a very high depth? Sorry I still very new to all this, so I'm trying to get a better idea of how everything works. Thank you again for the help!!
Have you looked at the bed file you generated? In my experience, GRCh39 for example does not annotate rRNAs so you might be missing a bunch. Silva is a place to start but doesn't look like it includes all of the 5S isoforms.
Yeah I looked at the BED file, I see a lot of 5S rRNA isoforms, 2 mitochondria rRNAs, and one 28s rRNA annotation. I don't know if I should expect to also see 18s or if thats a precursor that's typically not annotated.
You should absolutely find the annotation for 18S, I like using this figure to see rRNA processing: https://www.researchgate.net/figure/Repeat-unit-of-the-45S-rRNA-gene-tandemly-repeated-in-human-chr-13-14-15-21-and_fig11_38019447
You can also look at the mitochondrial genome and look for an annotation for MT-RNR1/2 corresponding to 18S and 16S respectively to verify
Yeah I have the mt-rnr1 and mt-rnr2 genes in my bed file.
Good analysis. Although he seems to just be guessing that it was 8.
Anyway, RNA-seq is rife with duplicates. To get rid of some of the artificial ones, you can use BBTools:
The specific distance that you use for optical duplicate removal is platform-specific. 50 is fine for NovaSeqX, NextSeq, and HiSeq 2000, 1T, and 2500. Consult the shellscript documentation for other platforms, it's highly variable...
That's generally a safe operation to reduce Illumina-specific artifactual duplicates. You can always get rid of all duplicates with "dedupe" leaving out "optical" but that makes no sense for quantitative RNA-seq experiments.
Doing any kind of PCR amplification on quantitative RNA-seq experiments will bias your results, so it should be avoided if possible, but if you have to do it, you have to do it. You can't remove the artifacts with digital deduplication, though, without compromising your results.