Question

How much paired end vs single end RNA-seq reads can have influnce on expression level quantification ?

2

Entering edit mode

9.5 years ago

jack ▴ 980

I have RNA-seq data experiment which have done for some biological samples.

I need to do the gene expression quantification using these read files and after that differential gene expression analysis.

But RNA-seq data which I have is bit unusual in sense of experiment design. I'm wondering how much this can have effect on differential gene expression analysis results.

The problem with RNA-seq data is that for condition A, the reads are paired end and for condition B the reads are single end. Moreover for some biological replicates of condition D, it's single end and for some others replicate of it are paired paired end. All the data come from same lab and same sequencing platform.

I need to do DEG analysis A vs. B , A vs. C and C vs. B

Does anyone have idea how much this can effect on the result of expression level quantification and DEG analysis?

alignment RNA-Seq next-gen-sequencing • 7.7k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by jack ▴ 980

3

Entering edit mode

You don't say, but did they also use the same library prep protocol? If not, you're sunk.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Michele Busby ★ 2.2k

2

Entering edit mode

One simple solution would be to use just single end reads i.e. discard the paired-data. For GEX (as opposed to transcript-level/isoform analysis) there is plenty of data to suggest that 1x50bp data is sufficient.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by scottbrouilette ▴ 50

Ram · Answer 1 · 2015-05-13

It's not ideal, but if you ensure that D is included in the input counts then you can model and compensate for the effect of paired vs. single-end. You could also use combat (from the SVA package).

For how much of an effect you'll see it's always a bit hard to predict. I think in the DESeq (or DESeq2) they use an example dataset that has both SE and PE samples to show how to use more complex models. There I recall that you definitely get a difference due to the sequencing type alone. So definitely don't read to much into an A vs. B comparison if you can't compensate for the PE vs. SE difference.

Ram · Answer 2 · 2015-05-13

1

Entering edit mode

9.5 years ago

Irsan ★ 7.8k

There is very minimal influence of read pairing when it comes to differential gene expression analysis(STAR + HTseq + edgeR/DEseq/...). For ~50 samples I had paired-end RNA-seq data and I analyzed the dataset in single-end and paired-end mode. The pearson correlation coefficient between single-end and paired-end of transcriptome-wide count data was >0.95 for all 50 samples. So my suggestion would be to run all your data in single-end mode, since you will have negliglible quality drop and you will be sure you can compare everything.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by Irsan ★ 7.8k

0

Entering edit mode

How if, I use paired end for the samples which I have paired end and single end for the samples which I have single end. then is it comparable?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by jack ▴ 980

0

Entering edit mode

If you have no option to re-process the raw data and analyze all data in single-end mode then use the library type as a covariate as suggested by devon ryan

ADD REPLY • link 9.5 years ago by Irsan ★ 7.8k

Ram · Answer 3 · 2015-05-13

If you are just mapping to a reference genome / transcriptome, probably the biggest difference would be if you are trying to estimate "isoform" expression, as paired reads would have greater discrimination when mapping different isoforms. Anyway, I would use just PE1 for all samples, it is the simplest way to get rid of "obvious" bias. And, depending on the quality of your PE2 reads - which tends to be lower than PE1 and sometimes much worst than PE1 reads - you may get better results with just PE1.

P. S.: there are "non-obvious" bias, e.g., as single end and paired end use different insert sizes, you may have bias at this step of the proccess.

edit:

Thinking a bit more about it, your condition D could help you decide: you can compare your single and paired end samples and check if variability inside PE and SE is the same or different as between SE and PE.