Entering edit mode
6.5 years ago
Biologist
▴
290
Hi,
I have raw counts data for two prostrate cancer cell lines. I want to get differential expressed genes between this two cell lines. Just doing a simple t-test is fine or I should use edgeR/DeSeq2?
If t-test is fine I want to make a violin plot so, do I need to convert raw counts to log cpm or RPKM?
Is this single-cell data, or regular "bulk" RNAseq data? If regular RNAseq, go with edgeR or DESeq2, if single cell, you have to look for a single-cell analysis pipeline.
Do not use a t-test.
ok. but it is only 1 cell-line vs another cell-line right. Is it possible to go with edgeR or Deseq2? If yes how?
Look in the vignette for edgeR, specifically chapter 4 (page 39.) https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Hi, I followed the tutorial. But there is an error.
@h.mon I want to check expression of specific genes between those two cell lines. Just to know which gene is more expressed in which cell line. Why not use a t-test?
It appears that you are just comparing 1 sample versus another (you have not mentioned anything about replicates), in which case you cannot faithfully conduct any type of statistical comparison. Your best bet is to calculate ratios for each gene in one sample versus the other.
Thanks Kevin. In the nature paper t-test between basal cell-lines and non-basal lines please check Figure1C and also its legend. They have just done students t-test to check the expression between basal and non-basal.
Similarly, in my analysis I have only two prostrate cell-lines and want to check which genes are more expressed in which cell-line. Do you think t-test is fine to find which genes are significant b/w both cell-lines? calculate ratios for each gene? What to do for that I have raw counts data like following:
Thanks for pointing that out. I note that they are doing a t-test using RPKM, which is incorrect. However, that's in the Nature journal series, the publications of which frequently contain improper statistical methods.
If you are going to use a t-test for your data, then at least normalise to CPM and neither FPKM nor RPKM.
Small help. So, first I converted counts data to logCPM. Lets say the above table is in a dataframe
tin2
.So, now I have genes as rows and two cell-lines as columns with logCPM values. From this data I want to make violin plot for each gene showing the significance between two cell-lines. Could you please help me how to do it.
You can generate a violin plot for each gene but it would not look normal. It would just contain a single line to represent the median, or just a 'blob' of a value - you only have 2 samples.
Getting back to Figure 1C in the manuscript that you mentioned, the legend states:
Those guys have 26 basal samples and 20 non-basal. So, their plot is plotting 26 values versus 20, and that's also what they compare with their t-test.
As you only have 2 samples, you cannot generate the same plot - sorry. Your best hope is to just derive the ratio for each gene between both cell-lines that you have.
Hi Kevin,
Thankyou. In edgeR tutorial I saw section 2.11 which can be followed if there are no replicates.
So, from the results which how to say which gene is upregulated in which cell-line?
Okay, but in which World is it a robust study to compare just 1 sample versus the other? Think of this scenario: you manage to get a replicate for each of your cell lines and then repeat the analysis (now 2 versus 2) and then find that virtually all of your fold changes have flipped direction. What then? You then get a third replicate, repeat it, and find that one of your original samples behaves like an outlier, and many of your fold-changes have again flipped direction. What then?
With just 2 samples, you certainly:
You should, at minimum, aim for 3 versus 3.
Thanks!
I see your point here. But I have only two prostrate cell-lines. And to knockdown the some particular genes in one of the cell-line, first I have to know in which cell-line those genes are highly expressed. so, it will be easier to me to select that cell-line and knockdown those genes. This was my idea for some experiments. So, I got the CCLE data for two prostrate cell-lines and thought of doing like above.
I see. In that case, the best that you can do is follow that tutorial by the EdgeR authors and to just be aware of the limitations. The genes with the largest absolute fold-change differences should reflect the ones in which you will be interested.
You may also consider transforming your logCPM data to the Z-scale and then take transcripts that have absolute Z-score>3 or >4 in either cell-line. Hopefully, the fold-change results and those o the Z-scores will match on many genes.
Thanks!
Sure. I will follow this. So, in my above EdgeR analysis
in EdgeR results above the genes with positive logFC are upregulated in prostrate cell-line1 and genes with negative logFC are unregulated in prostrate cell-line2. Am I right?
And as you said with above given counts data first I transformed them into logCPM like below:
From this I transformed to Z-scale.
Did I go wrong some where?
Hey, regarding the Z-scores, that's just a consequence of having too few dimensions in your data - apologies for suggesting that idea.
Regarding the fold changes, it will depend on the direction of comparison. A useful way to check is to just look at your normalised counts and contrast them to the fold-change directions for the purpose of inferring this.
Ok. please see this example.
logCPM of MIR137HG
The expression is high in Prostrate_Cell-line1.
And differential analysis between both cell-lines gave -0.9696249 logFC for MIR137HG [the value is negative because DEA is Prostrate_Cell-line2 vs Prostrate_Cell-line1. If I do Prostrate_Cell-line1 vs Prostrate_Cell-line2 the value will be positive]
Is this the way you told me above?
In that case, it actually looks like it is Cell-Line2 / Cell-Line1, and that Cell-Line 1 is the 'reference' level.
The fold-change indicates that cell-line 2 has fractionally less expression.
You may want to check other genes with large fold-changes just to confirm This can be changed where you have stored your group variable with:
Sorry, could you please clarify one thing in the above analysis.
Summary shows
1+2
6 genes were Upregulated in which cell-line?I assume up-regulated in cell-line 2. Just check the expression of each gene though, to double check.
Sure. will look into that. thank you
Good luck with it and keep me updated!