Hi everyone. I have an RNAseq dataset that I had done at a core facility (they did the library prep, sequencing, and initial analysis). At first, I tried to verify some of the interesting DEGs by qPCR but was unable to do so for most of them (could not verify 18 out of 20), either in the same samples I submitted for sequencing or in an additional cohort.
I went back to the raw data which I did not have access to before, and found what looked like a lot of PCR artifacts from the library prep. This makes me think that a lot of my DEGs are not actually differentially expressed but are instead artificially different due to library prep.
My questions are then: 1) Is there any way to confirm PCR artifacts in RNAseq data? 2) How do I deal with this? In looking a the core's analysis, it seems like they performed a deduplication step. I've read that you should not deduplicate RNAseq data... should I redo this analysis?
Thanks for any advice.
Can you clarify "looked like a lot of PCR artifacts from the library prep"? Do you mean you have a lot of duplicates or is it something else? Is it just in one group and not the other?
Its a lot of identical reads, seems to be in all my experimental groups, but is not consistent across genes. I'm attaching an example. If I run a deduplication on the data, something like 25% of my reads are thrown out as duplicates.
Example: https://ibb.co/jQ598w
Are there specific artefacts you're concerned about or you think that there are biases in the PCR?
If the latter there are a number of papers now which have mathematically modelled the biases PCR introduces, and you may be able to run your libraries through their analyses to correct for it.
Biases in the PCR- for example, some samples seem to have PCR duplicates (tons of identical reads) in a particular gene while others don't, therefore the one with duplicates is coming up as differentially expressed when it's really not.
I've read about deduplication algorithms, but it looks like most people say that you should not deduplicate for RNAseq. I'm finding conflicting information on that though.
What's your biological and technical replicate set up? Can you compare between samples that should theoretically be the same and see if the 'duplications' are present in all of them?
I guess it depends on what your definition of a biological/technical replicate is- I have 3 conditions with 5 replicates each. One of the samples was sequenced twice (from the same library prep) as a sequencing technical replicate. The two links below should be identical- to me, it looks like there are identical/duplicated reads in each, but they are different between the two.
https://ibb.co/iqySdw https://ibb.co/moDtJw