Hi all
I need to do a weighted correlation network analysis on a dataset made of gene expression data of 169 samples (patients) that I downloaded from GEO, and I am trying to use the WGCNA package on R.
A colleague told me I have to collapse the probesets to genes. I have tried to do it through the collapseRows() function, but I'm experiencing some errors.
Here's my function call:
result <- collapseRows(datET=combat_edata,
rowGroup=unique(combat_edata_genes),
rowID=rownames(combat_edata),
method="function",
methodFunction=colMeans)
The combat_edata
dataframe is my output from the ComBat()
function that I used for batch correction, while the combat_edata_genes
list is the result of the retrieval of the genes associated to "combat_edata
in the hsapiens_gene_ensembl
mart, through the getBM() function.
As you can see, I used combat_edata
as my main input data, and I set rowGroup
as the list of unique genes. I set rowID
as the row names of the dataframe (features of the GEO dataset).
My call to collapseRows()
returns an error:
Error: rowGroup and rowID not the same length"
Which is true. Here are the lengths of the variables:
length(rownames(combat_edata)): 33,297
length(unique(combat_edata_genes)): 27,217
length(combat_edata_genes): 47,127
The number of rows of my dataframe (the features of the GEO dataset) and the number of genes are different. I don't know how to handle the situation... Do you have any idea on how to handle this issue?
How can I solve perform the collapsing of the probesets to genes correctly?
EDIT: I don't necessarily have to use collapseRows()
; if you know another method and you can explain me how to use it easily, you're welcome to propose it. Thanks
Thanks, but how do I perform that operation in R?
Which operation exactly? If you need to annotate your probesets in order to get list of corresponding genes. You can do it like this:
It will give you 1 to multiple relations list. Then you just check, which probesets map to different genes at the same time using basic R operations with data.frame.
Or if that's not what you ask, please, elaborate.
P.S. What array you use? Because if Affy, then instead of WGCNA you must use different probesets annotations from BrainArray project (custom cheap definition files).
Hi Aln
Thanks for your help.
I tried to use your code but it gives me an error:
How can I solve this error?
Also, I don't know about the BrainArray project. If you have suggestions on that, please share them. Thanks!
My example was for the particular Affymetrix array, your probesets ids do not look like typical affymetrix ids. What array do you use exactly?
Thanks for having replied. It's Affymetrix, as I read it on the GEO repository for the dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59867
You could try using the annotation file from Gemma for this platform: https://gemma.msl.ubc.ca/arrays/showArrayDesign.html?id=380 , click the link that says "Basic"
First column is ProbeName, second one is a gene symbol. If you use it in your work, don't forget to cite Gemma (Zoubarev, A., et al., Gemma: A resource for the re-use, sharing and meta-analysis of expression profiling data. Bioinformatics, 2012.)
You could also download the annotation for the platform and parse it out yourself. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6244 (The "Download full table..." button.)
Davide, I am having some déjà vu... I originally showed you how to convert Affy probe IDs to gene symbols, here: C: How to map Affymetrix probe ID's from a GEO dataset to corresponding gene symbol (follow the other link).
To then summarise expression over genes, use the
aggregate()
function in R.Note that, with microarray data and starting with the raw data files, one can summarise expression over genes or exons via the
target
parameter that is passed torma()
. See here: C: Human Exon array probeset to gene-level expressionAffymetrix has different types of arrays, where each array has its own annotation. hgu133plus2.db is an annotation for a particular array, that's why it doesn't work with yours. Your array type is Affymetrix Human Gene 1.0 ST. And judging from your answers you didn't have experience with analyzing arrays in R. I would simply recommend reading this tutorial:
https://bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf
Oligo package should work with your array type. And I would certainly recommend starting with raw data.
You can read about Brainarray here:
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
Just want to add one more comment. There are many ways how to analyze array, with R packages or web-services that were mentioned in this post already. And each analysis step (probes summarization, normalization, annotation etc) can be also done with different tools. You just need to try few of them and choose what you like the most. And the best way to help is if you post actual source code, so we can point you to the right direction, as now your questions are very broad.
The important thing one needs to understand about arrays is that their probe/probeset design is not perfect, and one needs to do additional steps to get one-to-one probe-gene correspondence. One way is to do some sort of "collapsing" or use re-annotated chip definition files (BrainArray etc). Or one can re-annotate probe sequences themself.
Hi Aln
Thanks for your reply and your help. As you might have understood, I'm a newbie with Bioconductor... It's difficult to me to get the right commands for my needs. My source code is all here on GitHub: https://github.com/davidechicco/heart-failure-gene-expression-analysis
Running it should be quite easy: you just need to change
FIRST_RUN = FALSE
toFIRST_RUN = TRUE
on the first run in theretrieve_genes.r
script, and then execute it.Rscript retrieve_genes.r
Feel free to play with it! My goal now is to collapse probesets to genes on the
combat_edata
variable.Thanks
EDIT: This is the head of the annotationLookup of the
combat_edata
variable, in case it helps understanding: