Question

single probe represent multiple genes?

0

Entering edit mode

8.3 years ago

mathavanbioinfo ▴ 80

Hello friends

I am doing micro array data analysis(HGU1333plus2), i got the expression matrix file by using gcrma , but the some probe is represent multiple gene like this . how can we treat this, then some probe is not matched it shows NA can delete it , next i take this file for analyze WGCNA , please share your knowledge ,

221251_x_at 1   221251_x_at INO80B /// INO80B-WBP1  NA
65133_i_at  1   65133_i_at  INO80B /// INO80B-WBP1  NA
223072_s_at 1   223072_s_at INO80B /// INO80B-WBP1 /// WBP1 NA
1559716_at  1   1559716_at  INO80C  INO80C
229582_at   1   229582_at   INO80C  INO80C
220165_at   1   220165_at   INO80D  INO80D
226555_at   1   226555_at   INO80D  INO80D

R • 3.4k views

ADD COMMENT • link updated 8.3 years ago by ddiez ★ 2.0k • written 8.3 years ago by mathavanbioinfo ▴ 80

score 1 · Answer 1 · 2017-02-10

1

Entering edit mode

8.3 years ago

palfalvi.gergo ▴ 10

You cannot separate them. They can show higher/lower expression level, which can mean one or both has elevated RNA levels, but sometimes if one increased and other decreased, you cannot see difference in microarray data. That is one limitation of microarray.

In WGCNA it should be no problem until they do not appear as hubs. If yes, confirm with qPCR all genes under that probe.

Alternatively just filter them before WGCNA and check how different the result with and without those probes. If no significant differences, no problem.

ADD COMMENT • link 8.3 years ago by palfalvi.gergo ▴ 10

0

Entering edit mode

This is true and typically caused by probe sets containing things that do not really map to that gene, as originally intended. However, in some cases it may point to differential expression of splice variants, etc.

ADD REPLY • link 8.3 years ago by ddiez ★ 2.0k

score 1 · Answer 2 · 2017-02-10

An alternative is to use the custom CDF (Chip Description File) files generated by the BrainArray people. Basically, they remap every probe in the array re-aligning the sequences to the latest genome assembly. Then group the probe into probe sets that represent different things. For example, one mapping is probes to NCBI Entrez Gene identifiers (called ENTREZG). Other mappings could be done, to Ensembl gene, transcript or exon level, etc. You have to choose the correct version of the custom CDF but if you are using the latest R+Bioconductor the latest version of the CDF should be OK. Then download and install the packages specific for your Affymetrix chip (search for HGU133Plus2 in the table). Note the Custom CDF Name column, that is the name of the CDF that you have to provide to the function, in this case HGU133Plus2_Hs_ENTREZG.

Advantages

Probe sets contain only probes mapping to the defined object (gene, transcript, exon, etc).
This improves the estimation of expression level differences.
It also simplifies enormously further analyses.

Disadvantages

Now different probe sets may have different number of probes mapping to it, so the expression levels are measured with different precision. The minimum number of probes per probe set is three.
You cannot use gcrma(), but it seems this is supported with justGCRMA() (look at the cdfname in the documentation.

Minimum example

NOTE: It seems after checking that you can indeed use gcrma() directly.

As you can see in the output here, the probe set names now are the Entrez Gene id plus the suffix "_at". You can get further annotations with the hgu133plus2hsentrezg.db package also available from the BrainArray web site.

library(affy)
library(gcrma)

f <- list.files(pattern = ".CEL") # your CEL files

# read in the data.
abatch <- ReadAffy(
  filenames = f,
  cdfname = "HGU133Plus2_Hs_ENTREZG"
)

# RMA
e1 <- rma(abatch)
head(exprs(e1))
                   S1       S2
1_at         4.926672 4.787678
10_at        5.405416 5.292615
100_at       5.617638 5.535115
1000_at      7.143342 7.069008
10000_at     5.884667 5.825513
100009676_at 6.202571 6.313520

# GCRMA (gcrma)
e2 <- gcrma(abatch)
head(exprs(e2))
                   S1       S2
1_at         2.234352 2.234352
10_at        2.231861 2.231861
100_at       2.231861 2.231861
1000_at      3.794568 3.794568
10000_at     2.509149 2.509149
100009676_at 2.231861 2.231861

# GCRMA (justGCRMA; all in one step).
e3 <- justGCRMA(filenames = f, cdfname = "HGU133Plus2_Hs_ENTREZG")
head(exprs(e3))
                   S1       S2
1_at         2.234352 2.234352
10_at        2.231861 2.231861
100_at       2.231861 2.231861
1000_at      3.794568 3.794568
10000_at     2.509149 2.509149
100009676_at 2.231861 2.231861