Differential gene expression analysis of real-valued expression data of genes and individuals
1
0
Entering edit mode
5.5 years ago
Davide Chicco ▴ 120

Hi

I've a dataset of gene expressions of 102 patients and 9 healthy controls. I downloaded this dataset from GEO, I applied several preprocessing steps (normalization, batch correction based on date, etc), and I was finally able to generate a table containing:

  • the individuals on the rows

  • the genes on the columns

  • each entry ij containing a real value that indicates the expression of the gene_i in the individual_j

This first preprocessing phase was a lot of effort. Now I would like to perform a differential gene expression analysis, to see how the genes expressions differ between the patients and the healthy controls.

I checked some packages online (such as DESeq2), and I noticed they all have specific requirements for input files, that need to contain raw counts. Unfortunately, I don't have raw counts.

I would like to perform a differential gene expression analysis by myself, by taking advantage of biostatistics R functions applied on my preprocessed tables.

How can I do it? Any suggestion?

Thanks!

gen-expression gene-expression RNA-Seq RNA • 1.2k views
ADD COMMENT
0
Entering edit mode
5.5 years ago
h.mon 35k

Look at Linear Models for Microarray Data, or limma. The User Guide is particularly helpful:

https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

ADD COMMENT
1
Entering edit mode

Adding on this, what you have is array data which provides you with intensity values, not counts so a relative measure of gene expression rather than absolute counts as in RNA-seq. limma seems to be pretty much the standard and following their workflow should get you the intended results. Be sure to read the manual thoroughly and also look at this end-to-end workflow for Affymetrix microarrays.

ADD REPLY
0
Entering edit mode

Thank you guys for your replies. There's a lot of material online and I feel like I am drowning in it. I found this interesting question and answer here on BioStars.org, that I tried to implement for my case. I used lmFit(table) and eBayes(fit), as explained, without design.

I was able to generate a table with the values of the fitted model for the patients, and a table for the healthy controls. This is the head of the topTable of the patients fit:

head(topTable(fit, n=Inf, sort="p", p.value=0.05))

logFC AveExpr t P.Value adj.P.Val B

EEF1A1P5 13.2 13.2 362 5.03e-26 2.05e-22 45.2

MIR6891 13.2 13.2 358 5.77e-26 2.05e-22 45.2

HLA.G 12.8 12.8 356 6.34e-26 2.05e-22 45.1

RN7SK 13.6 13.6 355 6.41e-26 2.05e-22 45.1

MALAT1 12.8 12.8 354 6.65e-26 2.05e-22 45.1

HLA.J 12.9 12.9 352 7.31e-26 2.05e-22 45.1

Some questions:

1) What is the meaning of these p-values associated to each gene that I found this way?

2) Was it a good/useful idea to split the patients and healthy controls into two different tables and perform the analysis separately? Or should I keep them together and insert this information into the design parameter? If the latter, how?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6