Extract common differentially expressed genes (DEGs) of different data sets. (Microarray Data Analysis)
3
4
Entering edit mode
10.3 years ago

Hi, I have 10 microarray data sets (each data set related to a disease) in which I already compiled to get differentially expressed genes (DEGs) in every individual data set. I want to extract common DEGs of all the datasets, is there any tool/R package/R function to do it.

I have the ProbesetIDs, Genenames, GeneSymbols, Entrez IDs, LogFC values, B values, P values; in all the datasets.

Any help would be appreciated. Thanks!

microarray • 8.0k views
ADD COMMENT
0
Entering edit mode

I need to know how did you extract those DEG in each of the particular disease. please help me out

ADD REPLY
4
Entering edit mode
10.3 years ago
David Fredman ★ 1.1k

If I understand it correctly, you simply want to find the unique gene identifiers (or probe ids) that are differentially expressed in all experiments? One simple way to do that in R would be (here for three sets):

a = c('gene1','gene3','gene5','gene7','gene9')
b = c('gene3','gene6','gene8','gene9','gene10')
c = c('gene2','gene3','gene4','gene5','gene7','gene9')

Reduce(intersect, list(a,b,c))

[1] "gene3" "gene9"

On a side note, if you are requiring a gene to be significantly differentially expressed in all experiments, that is a fairly tough threshold. Since experiments typically do not have the power to detect all genes that are truly differentially expressed, you are likely to miss some in each set due to randomness. You could, alternatively, require that a gene is significantly differentially expressed in some samples, and has a fold change in the same direction (or over some meaningful threshold) in others.

ADD COMMENT
0
Entering edit mode

True indeed. Hope I would atleast get 25-30 genes in common.

ADD REPLY
3
Entering edit mode
10.3 years ago
Cytosine ▴ 460

Trying to do something like this?

gene <- c("a", "b", "c"); expr <- c(2, 2, 3);
x <- data.frame(gene, expr)
gene <- c("c", "b", "e");
y <- data.frame(gene, expr)
temp <- merge(x,y,by=match("gene", colnames(x)))
gene <- c("c", "e", "d");
z <-data.frame(gene, expr)
temp <- merge(temp, z, by=match("gene", colnames(temp)))
#...
#repeat for all your dataframes

Essentially you're matching the dataframes 1 by 1 on a specific column until you've merged all of them.

In your case you could go matching by e.g. "Genenames".

ADD COMMENT
0
Entering edit mode

Will try it. Thank you.

ADD REPLY
0
Entering edit mode

This is really useful. I can actually extract the values like LogFCs, P values with the genenames using the above method. Didn't try it yet, but it should work for sure. Thank you so much :)

ADD REPLY
0
Entering edit mode
10.3 years ago

That's so simple, why didn't I get this :facepalm:

For example, the 4th column of every dataset has the gene entrez number (Il anyway do it with genesymbol & genename). So I'l do it as follows

1D <- as.vector(dataset1[ ,4])
2D <- as.vector(dataset2[ ,4])
.
.
.
10D <- as.vector(dataset10[ ,4]
11D <- Reduce(intersect, list(1D,2D...,10D))

Thank u so much :)

ADD COMMENT
1
Entering edit mode

you're welcome ;) the functional nature of R is powerful.

upvoting and/or accepting useful answers makes the site more efficient, so is encouraged.

ADD REPLY
0
Entering edit mode

Sorry for this noob question, stuck at some point for past few days.

I do the following commands.

> dataset1
Probe-ID    Genename    Genesymbol    LogFC

A              ATPoly          ATP                0.2    
B              BTPoly          BTP               -0.5
C              CTPoly          CTP                0.8
D              DTPoly          DTP                0.7
E              ETPoly          ETP               -0.3

> dataset2
Probe-ID    Genename    Genesymbol    LogFC

C               CTPoly         CTP               0.1    
D               DTPoly         DTP              -0.6
E               ETPoly         ETP               0.7
F               FTPoly         FTP                0.9
G              GTPoly         GTP               -0.2

D1 <- as.vector(dataset1[ ,3])
D2 <- as.vector(dataset2[ ,3])
AD <- Reduce(intersect, list(D1,D2))

> AD
Genesymbol
CTP
DTP
ETP

By doing the above commands, I can only get back the common genesymbols which are common in dataset1 & dataset2.

I couldn't figure out how to retrieve LogFC values and Genenames with the genesymbols of both the datasets. I need something like this.

Genesymbol    Genename    LogFC-dataset1    LogFC-dataset2
CTP                CTPoly          0.8                      0.1
DTP                DTPoly          0.7                     -0.6
ETP                ETPoly         -0.3                      0.7

I think the LogFC values & Genenames of both dataset1 and dataset2 should be retrieved individually on the basis of 'AD'.

How can I actually do it? I tried the merge function, but couldn't get it. Being a hardcore biologist and beginner in bioinformatics, its a really confusing to get it.

ADD REPLY
0
Entering edit mode

I got it :) match fn did the job,

final <- match(AD[,1],dataset1[,3],nomatch=NA_integer_,incomparables=NULL)

> final
3 4 5

# above numbers are the rows

> dataset1[c(3,4,5),]

Probe-ID    Genename    Genesymbol    LogFC
C              CTPoly          CTP                0.8
D              DTPoly          DTP                0.7
E              ETPoly          ETP               -0.3

I can do it on every individual dataset and combine everything.

Thanks :)

ADD REPLY

Login before adding your answer.

Traffic: 1292 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6