Hi,
I am reanalyzing some old RNA-seq data using single-end 50bp reads. I aligned it with STAR, removed duplicates with Picard MarkDuplicates, and used htseq-count to determine the reads per gene.
I noticed that I am getting lower mitochondrial (0.15%-0.25%) and ribosomal fractions (0.6%-0.9%) than in other analysis of an almost identical cell line using pair-end 150bp reads ( 0.5%-1.5% mt ; 3%-8% ribosomal). Both using the same reference genome and annotation.
I was wondering if these differences I am seeing are a known effect of the sequencing technology used, or if they reflect some biological differences between cell lines.
Thanks,
I would guess it reflects more of a difference in library prep rather than something biological or due to sequencing method.
The longer reads can possibly increase mapping to repetitive regions and the paired end can decrease the amount of duplicates removed. So those factors might increase the percentage of mt and rRNA reads.
In general, I believe common practice is to not remove duplicates for RNA seq data unless there's a reason for it.
I think you are very much right. I autopiloted into removing duplicates because I am used to paired-end sequencing with UMIs, which let you get rid of PCR-duplicates.
By not removing duplicates on this data I am getting 2.5%-7% mitochondrial and 2%-6.5% ribosomal fraction.