Question

Differential expression analysis with very high duplication rate

1

Entering edit mode

9 months ago

JMB ▴ 20

Hi,

I am doing a DE analysis on an RNA-seq dataset and have a question about PCR duplicates. The organism is a bacteria with a small genome and the libraries were over-sequenced, resulting in duplication levels >90% (from Picard; sequencing was paired-end). I know the general consensus is to not remove PCR duplicates for DE analysis. I have read a lot of posts about this, but can't really find comparable cases where the duplication rate is this high. I have concerns about the validity of the analysis if almost all the data come from duplicates. If I remove duplicates I would still have plenty of data to do DE analysis. I am hoping to get some feedback on whether to proceed with removing duplicates since the rate is so high or whether it would still be better to leave them in. Thank you!

RNA-seq DE Picard PCR duplicates • 948 views

ADD COMMENT • link 9 months ago by JMB ▴ 20

0

Entering edit mode

The organism is a bacteria with a small genome and the libraries were over-sequenced,

What kind of coverage do you have now? Is it in 100s of fold? If you over sequenced the libraries then the better option may be to randomly downsample the data so you end up with 25-30x coverage. Perhaps do a couple three sets to avoid any kind of bias. Then see what you end up with.

You can use reformat.sh from BBMap suite or any other software of your choice.

ADD REPLY • link 9 months ago by GenoMax 147k

0

Entering edit mode

^Yes, I second this actually.

ADD REPLY • link 9 months ago by dsull ★ 6.9k

0

Entering edit mode

Thanks so much for your suggestions. I don't have the data in front of me but coverage is definitely very high (certainly the hundreds). Regarding downsampling, I did think about that as an option. However, wouldn't this lead to the same problem? If I randomly sampled from a set of reads that are 90% duplicates, wouldn't the resulting sample also be expected to be 90% duplicates? On the other hand, if I removed duplicates this would selectively downsample so I don't have that problem. I will also run the analysis both ways, but welcome any other thoughts or suggestions. Thanks again.

ADD REPLY • link 9 months ago by JMB ▴ 20

0

Entering edit mode

I don't think the problem of 90% read dups would go away but there is some hope that the percentage will reduce, if you try random sampling.

I assume you don't have UMI's so you can't really say that these are all PCR dups with certainty. If the problem is a result of over amplification of low input RNA, then no amount of informatic wrangling is going to fix the issue.

At this point you will only lose time. So perhaps try sampling and de-duplicating like @dsull recommended.

ADD REPLY • link 9 months ago by GenoMax 147k

score 0 · Answer 1 · 2024-02-06

0

Entering edit mode

9 months ago

dsull ★ 6.9k

Why not try both? Your results may look like nonsense in one case but not in the other case. I don't know how your library was prepared or what your data is, but this is something that I recommend (but is almost never done): Do both and report both in your paper. If you get a crappy result in one method, still report it so people know not to do it.

Also, I doubt that many reads came from the exact same position along each transcript so deduplication should be fine. I've deduplicated RNAseq reads before.

ADD COMMENT • link 9 months ago by dsull ★ 6.9k

1

Entering edit mode

Just as an update for anybody who is interested: I ran the analysis with duplicates included and duplicates removed and got very similar results despite having duplication rates of 90-95%. I will probably go with the analysis removing duplicates just because I feel better about it, but good to know that even with very high duplication rates, the results were not misleading.

ADD REPLY • link 9 months ago by JMB ▴ 20