Rna-Seq: Treat Samples As Biological Replicates?
4
7
Entering edit mode
11.0 years ago
jockbanan ▴ 440

Hi! I've just got some RNA-seq data from a biologist. It is 4 patients, each tumor tissue and normal tissue, no replicates, so 8 samples together.

I don't like the fact of not having biological replicates, but I'm thinking this way: I'm interested in genes differentially expressed in the same way in all 4 tumors. So can't I just treat these data as one experiment with 4 biological replicates? There will probably be much more variance between the samples then there normally is between biological replicates (In this case, I would imagine biological replicates as 2 or more samples from the same (tumor or normal) tissue from the same patient) but still, do you think this is possible? i.e. which of these approaches would you suggest? :

a) Use my favorite differential expression software and simply use my patients as biological replicates.

b) Count fold change in genes' read number separately for each patient and try to perform statistic tests myself (possibly some "special" statistics?)

c) Use my favorite differential expression software for each patient separately as it is not so tragic not to have replicates and report genes found significant in all 4 patients.

d) Only count fold change in genes' read number separately for each patient, as it is a disaster not to have replicates and my favorite differential expression software would not give me meaningful results and then prey subsequent lab tests will find some of the genes top scoring in all 4 patients interesting.

e) Something else.

Thanks a lot!

rna-seq replicates • 9.6k views
ADD COMMENT
2
Entering edit mode

Why can't you interpret your samples as biological replicates? I think they are. Your assumption about increased variability compared to samples from single patient is possibly correct, reducing the power to detect something, however if you do find some genes differentially expressed, they are more transferable. This scenario is definitely better than having only 4 samples from a single patient. So I'd say a) and go ahead (possibly b) too).

ADD REPLY
2
Entering edit mode

That's a completely normal setup for a cancer experiment. There's no need for per-patient replicates because you really aren't interested in finding (just controlling for) per-patient differences. As Michael mentioned, option (a) is the right way to go (in fact, there are likely examples of this sort of analysis in the edgeR or limma vignettes).

ADD REPLY
0
Entering edit mode

Thanks for your answers, that's good news. Now I think this is much cleaner for me.

ADD REPLY
4
Entering edit mode
11.0 years ago

As mentioned above, this is a completely normal scenario and you do in fact have biological replicates. Just wanted to add that you should make sure to use a paired design in the analysis, as the inter-individual variation can be quite large and obscure the tumor/normal variation unless you control for the individual. You can do this in DESeq, edgeR, limma and probably other packages as well. In the edgeR manual, for examples, there is a section "4.4 RNA-Seq of oral carcinomas vs matched normal tissue" which walks you through a paired analysis scenario for cancer vs. matched normal tissue samples.

ADD COMMENT
0
Entering edit mode

Thanks a lot for this important note. So Cufflinks won't be the tool of choice this time. I will try edgeR and DESeq ("multi-factor designs" section of documentation seems to be covering this here, am i right?)

ADD REPLY
0
Entering edit mode

Yes, that's right. By the way, SAMSeq (samr package) can also do a paired test but there are likely too few replicates here for that (non-parametric) method, which works better with many biological replicates.

ADD REPLY
4
Entering edit mode
11.0 years ago

Depending on what are your replicates you will answer different questions.

A) If you have RNA from four different chunks of the same tumour of the same patient you can answer a question like "What genes are differentially expressed in the tumour of this patient?"

B) If you have RNA from four different tumours of the same patient you can answer a question like "What genes are differentially expressed in the tumours of this patient?"

C) If you have RNA from four tumours of different patients (hopefully the same kind of tumour), you can answer a question like "What genes are differentially expressed in this kind of tumours"

Clearly, C is more informative of B that is more informative of A, however, due to high heterogeneity in tumours (different tumours of the same patient might be different genetically and different tumours of different patients most definetely are) Answering C is much harder than B which is probably harder than A. If the four patient's tumours are very different, you won't find anything differentially expressed with statistical significance.

ADD COMMENT
2
Entering edit mode
11.0 years ago
Michele Busby ★ 2.2k

You can also just do an old fashioned paired t-test. When we compared t-test to DESeq it had ~ the same power if there were 4 reps.

http://bioinformatics.oxfordjournals.org/content/29/5/656.short (supplement)

The assumption of a normal distribution doesn't seem to be any worse than the assumption of a negative binomial if you do the normalization correctly (median normalization) and a t-test doesn't introduce the systematic biases that you see when the variance is calculated with information sharing which makes downstream analyses easier to interpret, especially if you have a high FDR (the false positives are biased for high variance genes when you use information sharing). Also, you don't seem to need information sharing in the variance calculation when you have 4 reps because the gene specific variance is pretty well estimated with 4 points.

Stay away from the fold change. You can get a lot of bouncing around that will give you big fold changes in the mean expression that don't mean anything.

ADD COMMENT
0
Entering edit mode

I agree for the most part, but I think fold-change can be useful if you use a minimum threshold for RPKM values. However, I do agree low-coverage genes can be very problematic for fold-change values. I also agree fold-change shouldn't be used without p-value / FDR calculation as an additional filter.

I would use log2(RPKM + 0.1) for analysis, but you can see the details in this paper (if you want, it is actually the same paper described in the blog post):

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

ADD REPLY
0
Entering edit mode
11.0 years ago

I think you should definitely not ignore an opportunity to take biological variability into consideration. 4 samples per group is kind of small for a patient cohort, but it is still better than losing the replicates entirely.

In general, I think this is a good paper describing the value of having replicates:

http://www.ncbi.nlm.nih.gov/pubmed/24319002

I think you've already got some decent advice about the differential expression part, but here is another link with references that you can check out:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

In fact, the paper that I describe in that post considers tumor versus normal expression in paired samples (but with a larger sample size)

ADD COMMENT

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6