Question

What kind of analysis is practically done on GSE data files?

0

Entering edit mode

5.6 years ago

shahdhruv7 • 0

I have a GSE data file in csv file format containing fields such as: ID, adj.P.Val, P.Value, t, B, logFC, Gene.symbol, Gene.title. In which adj.P.Val, P.Value, t, B, logFC fields being numeric. What are the factors I need to consider if I want to cluster the data only on logFC using K-Means clustering algorithm ? And first of all is it feasible to perform clustering on GSE data files ? If yes, what should be the approach ? If not, what different kinds of analysis can be performed on such kind of datasets ?

gene-expression GSE • 1.2k views

ADD COMMENT • link updated 5.6 years ago by Michael 56k • written 5.6 years ago by shahdhruv7 • 0

0

Entering edit mode

What question are you trying to address with this work? One doesn't just analyse data for the sake of analysing data.

ADD REPLY • link 5.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Your question is unspecific. If you want to do kmeans then please read a tutorial and then ask specific questions. People are typically happy to help debugging your code or advise you towards specific problems but reluctant with spoon-feeding. Therefore, please first invest some effort into getting a background and then come back with specific questions.

ADD REPLY • link 5.6 years ago by ATpoint 88k

score 1 · Answer 1 · 2019-11-22

That looks like the output from a statistical test on RNA-seq or microarray data, SWATH, etc.. You cannot run meaningful cluster analysis on it because it contains only a single condensate differential expression value. This dataset contains two groups: the "significant" and "non-significant" genes and these depend on your cutoff for adj.P.value (e.g. 0.05) and logFC (e.g. +-1). You can do a few things that are pretty much standard:

Get the raw data and pre-process and cluster them, given there are more than 2 conditions or samples this might make sense, and maybe using only significant genes.
Get more meaningful contrasts like this from similar experiments, that means, change your experimental design to accommodate a time-series, different stressors, multiple cell-lines, you name it
Do an enrichment analysis, e.g. GO enrichment of the significantly differential genes

Maybe simply make a heatmap instead of k-means, because k-means output is not really great to visualize. As others have noted, it might be better to think about the experiment design and experimental question while planning the experiment. If you were simply given that file to toy around with k-mean, that is not a good start, and you should be able to find a much more suitable multivariate dataset.