Question

replicates in edgeR and DESeq2

1

Entering edit mode

7.9 years ago

hibari.kyouya ▴ 10

Hi everybody,

This my first post here, so first I would like to thank active people on this forum because it was very helpful to me :)

Here we can find reads counts for 935 cancer cell lines : https://ocg.cancer.gov/ctd2-data-project/translational-genomics-research-institute-quantified-cancer-cell-line-encyclopedia There is just one reads counting for each cell line.

I took some of these cell lines and grouped them between "resistant" and "sensitive". Now I would like to run edgeR or DESeq2 for differential gene expression.

For example let's say I have 6 cell lines : A, B, C, D, E, and F, with their reads count. In the group "resistant", I have A, B and C. In the group "sensitive", I have C, D and E. I would to use DESeq2 or edgeR in order assess differentially expressed genes.

So in this configuration, their is no technical replicates, but the 3 different cell lines of each group are considered as replicates. Is that correst to do this ?

Thank you very much.

RNA-Seq edgeR cancer cell line encyclopedia DESeq2 • 3.8k views

ADD COMMENT • link updated 7.9 years ago by mforde84 ★ 1.4k • written 7.9 years ago by hibari.kyouya ▴ 10

1

Entering edit mode

In your design, your replicates are the group members. So 3 cell lines in "resistant" group and 3 cell lines in "sensitive".

If these cell lines should be treated like replicates is a question that you, the scientist, need to answer. Does it make sense to treat these chosen cell lines as members of the same group? You can also see if a MDS plot in edgeR groups these "replicates" together.

PS. You don't need technical replicates for analysis in edgeR or DEseq2.

ADD REPLY • link 7.9 years ago by Benn 8.3k

0

Entering edit mode

Also having the classification resistant and sensitive may not give you what you want. The cell lines you are grouping must overlap at some specific features and only the genes corresponding to the overlapping features should be taken seriously after the analysis. Even in this case, it does not feel very comfortable. Its like grouping red apple, strawberry, cherry vs banana, lemon, corn based on their colors.

ADD REPLY • link 7.9 years ago by firatuyulur ▴ 320

0

Entering edit mode

@firatuyulur exactly! But I thought that dispersion estimation is supposed to take that into account...

ADD REPLY • link 7.9 years ago by hibari.kyouya ▴ 10

0

Entering edit mode

I don't remember where exactly in the documentation it says this for DESeq, but you should only calculate the dispersion estimate for conditions with multiple biological replicates. You can still compare a group with one biological replicate with another group using that estimate, however the comparison will likely not be very accurate.

ADD REPLY • link 7.9 years ago by mforde84 ★ 1.4k

score 1 · Answer 1 · 2017-01-05

As said by b.nota, technical replicates aren't required. The alternative is to use biological replicates. Then the question is "Are those cell lines biologically the same group?"

But based on your post I suspect the outcome of your analysis won't be the "sensitive vs resistant" but "cell line A" vs "cell line B". I expect the tissue-effect to be bigger than the sentitivity-effect.

score 1 · Answer 2 · 2017-01-05

Biological replicate would in this instance refer to independent samples of the same cell line. A technical replicate would be a subsampling the same sample. In the later case, say you're able to get the alignment files for the cell lines (I believe for CCLE these should be available), if you subsample some percentage of the alignments Nth times, you would be able to calculate confidence intervals for each cell line.

The way you've currently outlined your analysis, it's questionable if it's a reliable approach. No offense or anything, you'll likely have to implement something a little more complex to determine how comparable cell lines within your groups actually are.

Molecular differences between cell lines can be night and day, even when they have similar phenotypes. These differences tend to be even more exaggerated when comparing actual tumor with cell lines. With that being said, a principle component analysis would be useful to determine how comparable cell-lines really are. Essentially what you're looking at is unsupervised clustering of your samples across two components that account for largest percentages of variance across samples. What you want to see is both segregation of your groups, and similar degree of spreading.

For the variability issue, I would bootstrap the DE analysis, e.g., you subsample 10 cell lines from each group, estimate dispersion, run the DE analysis, and do this 1000 times. For DESeq2 this will likely take some time, so edgeR might be a more suitable option just in terms of the speed. Each analysis will give you a list of DE genes ranked by pvalue. You can then test how consistent your results are across iterations by doing rank aggregation.