Question

Extract common differentially expressed genes (DEGs) of different data sets. (Microarray Data Analysis)

4

Entering edit mode

10.3 years ago

jeevan92ultimate ▴ 40

Hi, I have 10 microarray data sets (each data set related to a disease) in which I already compiled to get differentially expressed genes (DEGs) in every individual data set. I want to extract common DEGs of all the datasets, is there any tool/R package/R function to do it.

I have the ProbesetIDs, Genenames, GeneSymbols, Entrez IDs, LogFC values, B values, P values; in all the datasets.

Any help would be appreciated. Thanks!

microarray • 8.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by jeevan92ultimate ▴ 40

0

Entering edit mode

I need to know how did you extract those DEG in each of the particular disease. please help me out

ADD REPLY • link 5.1 years ago by tej_pim18 • 0

0

Entering edit mode

10.3 years ago

jeevan92ultimate ▴ 40

That's so simple, why didn't I get this :facepalm:

For example, the 4th column of every dataset has the gene entrez number (Il anyway do it with genesymbol & genename). So I'l do it as follows

1D <- as.vector(dataset1[ ,4])
2D <- as.vector(dataset2[ ,4])
.
.
.
10D <- as.vector(dataset10[ ,4]
11D <- Reduce(intersect, list(1D,2D...,10D))

Thank u so much :)

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by jeevan92ultimate ▴ 40

1

Entering edit mode

you're welcome ;) the functional nature of R is powerful.

upvoting and/or accepting useful answers makes the site more efficient, so is encouraged.

ADD REPLY • link 10.3 years ago by David Fredman ★ 1.1k

0

Entering edit mode

Sorry for this noob question, stuck at some point for past few days.

I do the following commands.

> dataset1
Probe-ID    Genename    Genesymbol    LogFC

A              ATPoly          ATP                0.2    
B              BTPoly          BTP               -0.5
C              CTPoly          CTP                0.8
D              DTPoly          DTP                0.7
E              ETPoly          ETP               -0.3

> dataset2
Probe-ID    Genename    Genesymbol    LogFC

C               CTPoly         CTP               0.1    
D               DTPoly         DTP              -0.6
E               ETPoly         ETP               0.7
F               FTPoly         FTP                0.9
G              GTPoly         GTP               -0.2

D1 <- as.vector(dataset1[ ,3])
D2 <- as.vector(dataset2[ ,3])
AD <- Reduce(intersect, list(D1,D2))

> AD
Genesymbol
CTP
DTP
ETP

By doing the above commands, I can only get back the common genesymbols which are common in dataset1 & dataset2.

I couldn't figure out how to retrieve LogFC values and Genenames with the genesymbols of both the datasets. I need something like this.

Genesymbol    Genename    LogFC-dataset1    LogFC-dataset2
CTP                CTPoly          0.8                      0.1
DTP                DTPoly          0.7                     -0.6
ETP                ETPoly         -0.3                      0.7

I think the LogFC values & Genenames of both dataset1 and dataset2 should be retrieved individually on the basis of 'AD'.

How can I actually do it? I tried the merge function, but couldn't get it. Being a hardcore biologist and beginner in bioinformatics, its a really confusing to get it.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by jeevan92ultimate ▴ 40

0

Entering edit mode

I got it :) match fn did the job,

final <- match(AD[,1],dataset1[,3],nomatch=NA_integer_,incomparables=NULL)

> final
3 4 5

# above numbers are the rows

> dataset1[c(3,4,5),]

Probe-ID    Genename    Genesymbol    LogFC
C              CTPoly          CTP                0.8
D              DTPoly          DTP                0.7
E              ETPoly          ETP               -0.3

I can do it on every individual dataset and combine everything.

Thanks :)

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by jeevan92ultimate ▴ 40

Ram · Accepted Answer · 2014-08-02

If I understand it correctly, you simply want to find the unique gene identifiers (or probe ids) that are differentially expressed in all experiments? One simple way to do that in R would be (here for three sets):

a = c('gene1','gene3','gene5','gene7','gene9') b = c('gene3','gene6','gene8','gene9','gene10') c = c('gene2','gene3','gene4','gene5','gene7','gene9')

Reduce(intersect, list(a,b,c))

[1] "gene3" "gene9"

On a side note, if you are requiring a gene to be significantly differentially expressed in all experiments, that is a fairly tough threshold. Since experiments typically do not have the power to detect all genes that are truly differentially expressed, you are likely to miss some in each set due to randomness. You could, alternatively, require that a gene is significantly differentially expressed in some samples, and has a fold change in the same direction (or over some meaningful threshold) in others.

Ram · Accepted Answer · 2014-08-02

3

Entering edit mode

10.3 years ago

Cytosine ▴ 460

Trying to do something like this?

gene <- c("a", "b", "c"); expr <- c(2, 2, 3);
x <- data.frame(gene, expr)
gene <- c("c", "b", "e");
y <- data.frame(gene, expr)
temp <- merge(x,y,by=match("gene", colnames(x)))
gene <- c("c", "e", "d");
z <-data.frame(gene, expr)
temp <- merge(temp, z, by=match("gene", colnames(temp)))
#...
#repeat for all your dataframes

Essentially you're matching the dataframes 1 by 1 on a specific column until you've merged all of them.

In your case you could go matching by e.g. "Genenames".

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Cytosine ▴ 460

0

Entering edit mode

Will try it. Thank you.

ADD REPLY • link 10.3 years ago by jeevan92ultimate ▴ 40

0

Entering edit mode

This is really useful. I can actually extract the values like LogFCs, P values with the genenames using the above method. Didn't try it yet, but it should work for sure. Thank you so much :)

ADD REPLY • link 10.3 years ago by jeevan92ultimate ▴ 40