I have a tabular dataset containing counts, which looks like this:
ID Gene Ctrl_1 Ctrl_2 Ctrl_3 TMZ_1 TMZ_2 ... Hyp_3
1 AAA 200 300 212 231 123 ... 231
2 BBB 100 231 3 123 99 ... 88
...
5000 ZZZ 30 999 54 3 2 ... 12
Each row represents a particular gene that is silenced. The numbers in the remaining columns represent how many cells are still alive after waiting 10 days in each of the three conditions "Control", "Temozolomide" and "Hypoxia", when a given gene was silenced. For example: when gene AAA was silenced, ten days later there were 231 cells that were still alive in the one sample treated with temozolomide (TMZ_1).
The aim is to find the genes that can be silenced such that there is a significant reduction in the number of living cells, from control to treatment.
My question is, what statistical test should one use to compare these results? I think that Table 1 in this article (http://www.ncbi.nlm.nih.gov/pubmed/19644458) might be helpful, but I don't know which to pick. Can anyone help?
(Edited for clarity)
A few more details might be helpful. Firstly, were the original number of cells in the dish identical (or at least quite close) in each of the samples? Secondly, do you have any single experiment with more than 3 samples? Finally, do you just want the pair-wise comparison or do you need something else (I'm too lazy to look up exactly what Temozolomide does to see how it relates to hypoxia)?
The question is what sort of variance distribution this sort of experiment follows. I hadn't replied to this earlier as I had rather hoped that perhaps someone else already new that. It could be that a negative-binomial distribution represents the variance properly (in which case applying glm.nb would work). Alternatively, perhaps just dividing by the initial cell count produces a nearly normal distribution, in which case a T-test would work. If you have a single shRNA experiment with more samples, then you can figure out what sort of distribution fits it properly.
OK, good to know, then, as that will change how we think about the numbers. You mentioned that these are effectively normalized counts. Did the normalization procedure just happen to result in integer values in the examples you show or are they always integer? If the latter, could you explain how the normalization was done. I ask because there are different methods used to deal with purely raw count-based data and data that isn't count based (e.g., you shouldn't use a T-test on raw-count data). Which sort of method ends up being most appropriate will likely depend on exactly how the counts were done and the normalization performed.
BTW, it's best not to do a Fisher's test with this data. Taking the average of the number or adding them up will likely not give you a reliable answer (this is just to essentially echo what Michael wrote in a comment below).
The numbers remaining after normalization are indeed always integers, although I do not quite know the method used for the normalization itself. I would even have expected that such a normalization step should produce decimals, but apparently it doesn't. I will find out.
This was the answer I received when I asked the sequencing center this question: "for each sample, we provided original counts, and counts normalized to 20M reads. 20M normalization for each barcode is: barcode reads*20M/total lane reads. normalized reads are rounded up to nearest integer." Does this answer get us any further in determining what approach to use with this data?
Wait, these are reads? Above you said that these were cell counts! Which is it?
Edit: Whoever came up with that library size normalization method should be fired immediately.
this is a good example that one cannot analyze data without knowing where they in fact come from and what they represent. Now, suddenly the "cell counts" turn out to be "sequencing reads". Unfortunately, I have the impression that this question and any attempt to answer it, is causing more confusion than clarity. Therefore, I see no other way than to close it for the reason of "cannot be answered/ too little detail given". That doesn't mean it can't be reopened after substantial clarification.
@Michael: That seems reasonable.
@jobinv, talk again to whomever setup this experiment and try to figure out what the experiment is prior to posting a new question. From your last reply, it sounds like standard RNAseq (you would need to use the non-normalized counts), though who knows at this point.
Hold on, hold on. It is still cell counts, and has never been anything different. The experimental setup of an shRNA study is such that there is a sequencing step involved to find the counts for each silenced gene. That is, the shRNA's have a barcode that is used to identify them. Thus: sequence the barcodes, and you know how many of the surviving cells contain that particular shRNA. I.e. it is still a cell count.
I can of course find more clarification, but what specific information is it that is needed? My first posting of the question was criticized for being too complicated, and I was told by Dr. Istvan Albert "You are really looking for methods to interpret counts in a tabular file - that operation is pretty independent on how you got those counts." As a result, I cut away all the experimental details and left only the tabular question.
Tell me what you would like me to find out (or answer myself, if it turns out that I happen to know already), and I will find it. But please don't close this, I do need the help...
Oh, there's nothing about bar-coded reads in any of the revisions of the original question. Is this paper an accurate representation of how your experiment is actually setup (btw, that's a methods paper for an R that's probably what you need for your analysis). If so, those counts don't actually represent cell numbers so much as integrated shRNA counts (there could be multiple copies per cell).
A word of advice as you continue your foray into data analysis, it's always a good idea to know exactly how any experiment was performed before trying to undertake an analysis of its results. If you can't sketch out the workflow (or go to the bench and do it), then you'll need more details before proceeding.
That article does look very relevant yes. I will read through it in more detail.
The advice is appreciated, and I agree. To be honest, I was actually under the impression that I had enough information about the experiment, but you're right, I guess I was a bit naïve about that. Thanks for your help. I do hope you don't have the feeling that I've wasted your time with this.
No worries, it's tough to know what you don't know until you know that you don't know it (sorry if that sounds like something Donald Rumsfeld would say). I ended up reading about functional genome-wide RNAi/shRNA screens, so I learned something in the process too!
Hi, thanks to the article I think we have enough information to open the question again. However, it will be better to edit the question again, and incorporate all the information in the question text. Now that you have an article that describes your method it will be much easier to help, certainly make sure that this is exactly what was done, and if not highlight the differences.
I disagree with the sentiment that one can simply look at a matrix of anything (counts, reads whatever) and then analyze it, and I am nor so sure that Istvan wanted to claim it like it sounds. I think it is important to know how the values were obtained and what actually the rows and columns are. As an example, counts of fish per volume in different parts of a river might behave very differently from counts of mapped reads from RNA-seq, even though the numbers could look similar. Knowledge about the experiment will give hints on the error distribution (e.g. poisson vs. neg. binomial (do I need to model overdispersion or not?)) and to choose the correct statistical model for the analysis.
I agree with dpryan that more details about the experimental setup are required, as to what is the actual question behind it. Do you have measurements of the same cell culture after 10 days and also before treatment? The question I would find logical to ask is: "Does KO of Gene X lead to a significant reduction in cell-survival after treatment with Temozolomide (or whatever)?" To answer this, you need measurements of cell counts before and after treatment as those are paired, number of cells after depends on number of cells at start in the culture. I would even say that if you do not have these before-treatment measurements no relevant conclusion can be drawn from the data at all.
In addition, the question should be raised about details of the cell counting methods. I have my doubts if these are in fact absolute counts (e.g the total number of cells alive is 231), or rather concentrations, e.g. average number of cells per unit volume. This might also have an influence on the statistical model to use. In addition, should we also take cell division into account? Could there be more cells after the treatment than in the beginning?