Hello everyone,
I'm studying R and I like to repeat published works, like this: Bioinformatics analysis of gene expression profile data to screen key genes involved in pulmonary sarcoidosis.
but in these protein-protein interactions works they do not give much informations abous how was analyzed, they say only the genes with the | log2fold change (FC) | > 1 and FDR <0.05 were considered the SDRs.
This paper used this GSE16538 from Crouser ED, Culver DA, Knox KS, Julian MW et al. Gene expression profiling identifies MMP-12 and ADAMDEC1 as potential pathogenic mediators of pulmonary sarcoidosis. Am J Respir Crit Care Med 2009 May 15;179(10):929-38
My questions are:
- These articles (PPI network) usually don't repeat the analyzes with the same parameters of the original paper, why?
- Because they don1t use the same parameters they find differentresults and this influences the networks, so I should or not repeat the parameters of the original paper?
- Now, in relation to my studies with R, even using the same cutoff parameters, I often ddon't find the same numbers of genes of the article, should I be doing something wrong? or does this happen even because of the few information about the R code used?
Best regards,
Leite
Hey Kevin Blighe,
Thank you very much for your reply,
What worries me about the different results found is exactly the lack of reproducibility, because each one will use a specific code in the R with specific packages which are often not mentioned in the methods.
Between the example I quote the two articles use the same data and get different results (in numbers of genes), I reproduced the two tests and got different results (in number of genes). But one thing that might be equal between their findings are the functions and pathways of them can be generally the same.
As I am studying R, was this doubt, is "normal" this difference of results?
Original article method: log2fold change (FC) | > 1 and FDR <0.05 were considered the SDRs.
In this paper they found 208 genes, using my code I found 288.
My R code: #Install the core bioconductor packages, if not already installed
PS: I manually filter the FDR <0.05 in the results table.
Best regards
Leite
Hi Leite, thanks - I moved your post.
I don't see anything incorrect in your code for limma, apart from the line
contrast.matrix <- makeContrasts(control-disease, levels=design)
. The order you're comparing control and disease here will just mean that positive fold-changes will represent genes higher in controls, and negative fold-changes will represent genes higher in disease. You can easily switch this around, i.e., disease-control, in order to get the opposite. This will neither affect the overall number of statistically significant genes.As I delve further into the methods, it becomes more clear why results differ from the initial study at least. Here are the methods used from the initial study by Crouser and colleagues:
So, they used completely different tests that have different assumptions than limma. They also do very specific filtering of genes prior to conducting the differential expression analysis. So, it's no surprise that results differ slighly from this study.
For the study by Li and colleagues, I can only see that they mention the use of limma in the abstract. The abstract, being just an abstract, doesn't give specific information on the exact methods. If you have access to the journal and can download the PDF, I encourage you to look to see exactly what the authors did when it came to their use of limma.
Hey Kevin,
Thank you for your attention,
For the study by Li:
This is the method he describes, compared to my code what's wrong?
When I use disease-control rather than control-disease, I find numbers equal to mine, but with inverted values when compared to those of work.
In paper Li said CCL5 is up reguled,
CCL5 = 1.455779 in Li paper
CCL5 = -1.433074703 with my R code control-disease
CCL5 = 1.433074703 with my R code disease-control
Now I'm very confuse.
Again, Thank you so much!
Leite
Hey Leite,
Looks like a very standard limma analysis that they did, but they mention a few things that are indeed vague. To start, I would have these questions for you:
At the end of their methods, they just state that they conducted a t-test and then adjusted by BH. This for me directly implies that they literally just used the
t.test()
function in r and then adjusted withp.adjust()
, and that they calculated fold-changes manually by dividing and then logging. Limma just fits a linear model to your data (akin tolm()
) and the essentially it is indeed a t-test that's conducted to compare your conditions of interest. So, this is not clear what exactly they did, but one could assume that they did indeed just use limma and are merely explaining the limma method in brief in their methodsOne key difference that I note is that you adjust your statistics by the empirical Bayes method, whereas they make no mention to this. That may contribute to the difference.
In general, you're probably now understanding better why research is so lacking in reproducibility, but this is a global problem that affects everyone at all levels.
Hey Kevin,
No, dont have 54674, only 54612 probes.
It looks like this is missing from my code, How can I add this?
How can I add t.test () in my code, should I remove eBayes ()?
If I understood correctly, there is no correct and fixed way of doing analysis of genetic expression, each researcher does the way he thinks is correct.
That's pretty much how it goes, yes. Well, 'she' or 'he', in relation to researchers. There are many very skilled female and male bioinformaticians / computational biologists.
In relation to this thread, it looks like the array is a relatively modern Affymetrix array design, thus, you should be able to mean-summarise over genes with the
rma()
function, with something like:Give that a try.
Also, yes, try with and without eBayes and see how it affects the results. The last time that I checked, it won't affect the most statistically significantly differentially expressed transcripts - only those at the frontier of statistical significance will be affected.
I would not use t.test() - I believe that they used the standard limma tests but that they just wanted to elaborate a bit further on how limma works.
The discrepancy in starting probe numbers may be due to control probes. Control probe IDs on the Affymetrix platforms generally start with 'AFFX' or 'affx', so, search for those in the celfiles.rma object that you create.
Dear Kevin,
Thank you very much, your help was of great importance for my studies.
No problem my friend! Good luck
Hey Kevin,
I sent to you a email, I dont know if you has received it.
Olá, yes, I will get back to you shortly!