Question

Human Exon array probeset to gene-level expression

5

Entering edit mode

7.6 years ago

mforde84 ★ 1.4k

Hi,

I'm working with some Affy HuEx arrays.

I'm able to do all of the necessary backgrounding, normalization, and summarization steps for the arrays, however I'm a little stuck on what's a reasonable way to collapse transcript level probesets to gene-level estimates. Should I do something as simple as aggregate by mean / median? Or is there a more robust way to estimate general abundance of a gene from these arrays?

Marty

affy microarray • 8.0k views

ADD COMMENT • link updated 3.3 years ago by michael.j.apostolides ▴ 50 • written 7.6 years ago by mforde84 ★ 1.4k

0

Entering edit mode

So anyone have any ideas?

ADD REPLY • link 7.6 years ago by mforde84 ★ 1.4k

score 6 · Answer 1 · 2017-09-08

6

Entering edit mode

7.6 years ago

Kevin Blighe 89k

Hey Marty,

I've analysed data from this chip recently (2016). Are you using the limma and oligo packages in R in order to process these?

When you get to the rma() function, you can specify there how you want to summarise the expression values:

Summarised per gene

rma(project, background=TRUE, normalize=TRUE, target="core")

If your array is an 'Exon' array, this may still only summarise to exon-level, requiring you to perform a further [manual] summarisation to gene-level. There are also other options available specifically for these 'Exon' arrays:

target = ’full'
target = ’extended’

Summarised per probe set:

rma(project, background=TRUE, normalize=TRUE, target="probeset")

Probe sets target different things, depending on the array type.

------------------------------------

NB - the behaviour of target will greatly depend on the array type!

Kevin

ADD COMMENT • link 5.8 years ago by Kevin Blighe 89k

2

Entering edit mode

Hi Kevin Blighe ,

How do I convert probe-level to gene-level expression values? I want a single expression value per gene. Currently the data I have looks like this:

1001_at       172.3406     136.3320
1002_f_at     132.2994     135.2457
1003_s_at    1344.5813    1930.1408
1004_at       388.5376     569.8755

There are these probe IDs, such as "1001_at" which is not useful to me, and as I understand it, there can be more than one probe per gene. What I really want is ENSG IDs and corresponding expression numbers. How do I get this?

EDIT: I realize that I am using the "affy" package, and that many of the answers use the "oligo" package, which is probably why target='core' isn't working for me

EDIT: I realize now based on reading oligo::rma documentation that I am dealing with an ExpressionFeatureSet object and that target="core" only works for Exon/Gene arrays. Is there any way to get gene-level info from an ExpressionFeatureSet object ?

EDIT: I think my data might be "old data", although it is not obvious to me how to find out exactly what "old" versus "new" Array data is. So far I have done the following:

# Read in CEL files
celpath <- "/path/to/CEL-files/"
data <- ReadAffy(celfile.path=celpath)
# Normalize data
eset <- affy::rma(data)

data@cdfName
[1] "HT_HG-U133A"

I obtained the CEL files from here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26312

EDIT: I have found a way to get most of the names for the probe IDs [https://www.researchgate.net/post/How-can-i-convert-Affymetrix-Probes-To-Gene-symbol ]

library(Biobase)
library(GEOquery)
library(annotate)
library(hgu133a.db)

probes <- rownames(eset)
annot <- AnnotationDbi::select(hgu133a.db, probes, c("SYMBOL","ENTREZID", "GENENAME"))

I realize now that my platform was [HT_HG-U133A] Affymetrix HT Human Genome U133A Array based on my GEO link and the internal data@cdfName variable in my loaded data.

This allowed me to find the appropriate annotation database for my probes (in my case hgu133a.db): http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData

ADD REPLY • link 3.3 years ago by michael.j.apostolides ▴ 50

1

Entering edit mode

This is of great help. If target isn't provided RMA by default normalizes, summarizes based on gene-level. ?rma doesn't provide any information either on this. Thanks.

ADD REPLY • link 7.0 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

Yes this is one of fundamental and critical things that is surprisingly not mentioned in many of the tutorials! I cannot remember how I figured it out, but it was a long time ago.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

1

Entering edit mode

Yes, I ran analyses with target as "core" and without anything assuming without target RMA will do probeset. I looked at the row counts for both results and everything was same. wasted many hours to make sure I did everything OK before stumbling onto this thread. :)

ADD REPLY • link 7.0 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

thanks kevin. what would you suggest for collapsing multimapped probesets? aggregate by mean / median, or something different?

ADD REPLY • link 7.6 years ago by mforde84 ★ 1.4k

1

Entering edit mode

I may have missed exactly what you mean by 'multimapped probesets'? During RMA normalisation, a Tukey's 'median polish' is applied using information from all probesets. Are you just referring to summarising transcript isoforms into a single expression value for each gene? - in the past, I have always used the mean in this case and results have been as expected.

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

my mistake, but yes just summarizing transcripts to gene level. i checked both the results for core and probeset summarization options. while im using SCAN.UPC instead, they are both appear to be generating estimates for transcripts. So to convert to entrez id, we simply select using key types matching in the transcriptcluster.db, then aggregate by entrez id to get gene level estimates.

thanks for the help.

ADD REPLY • link 7.6 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I am trying to summarize on gene level with target=core, but get this error: Error in .local(object, ...) : unused argument (target = "core"), even when specifying oligo::rma( Do you know why is this happens?

ADD REPLY • link 5.5 years ago by salamandra ▴ 550

0

Entering edit mode

Which array are you using? are you specifying:

rma(..., target = core)

...or

rma(..., target = 'core')

???

ADD REPLY • link 5.5 years ago by Kevin Blighe 89k

0

Entering edit mode

I tried both: array is Affymetrix Human Genome U133 Plus 2.0 Array from this study: GSE28799 Code is:

library(oligo) 
library(GEOquery)
platform = 'HG-U133_Plus_2' # specific affymetrix microarray platform
dataDir <- 'D:/Tania_2/PhD/CSCs/data/' 
studyId <- 'study10' 
GEOId <- 'GSE28799'# GEO Id of the study/series
GPLId <- 'GPL570'


treatment1 <- 'CNSCs'
treatment2 <- 'CSCs'

treatment1Samples <- c('GSM713222', 'GSM713223', 'GSM713224') 
treatment2Samples <- c('GSM713227', 'GSM713228', 'GSM713229') 
sampleNames <- c(treatment1Samples, treatment2Samples) 
studyDir <- paste0(dataDir,studyId) # path of directory of specific study
dir.create(file.path(studyDir), suppressWarnings('already exists')) # creates a new directory for study
getGEOSuppFiles(GEOId, baseDir = studyDir) # get .tar file with study raw data (it's a supplementary file at GEO page)
tarfile <- paste0(studyDir, '/', GEOId,'/',GEOId,'_RAW.tar') # path to tarfile
FilesInside <- untar(tarfile, list = T) # list file names inside tarfile,have to set list = T to only list them
CELfiles <- grep('CEL', FilesInside, value = T) # select names of CEL files only. 
SampleCELfiles <- CELfiles[grepl(paste(sampleNames,collapse='|'), CELfiles)] # select names of CEL files of the samples we want to analyze only
untar(tarfile, files = SampleCELfiles, exdir = paste0(studyDir,'/', GEOId, '/CEL')) # untar CEL files of samples we want to work with into subdirectory /CEL
pathSampleCELfiles <- paste0(studyDir, '/', GEOId,'/CEL/', SampleCELfiles) # path of CEL files of samples we want to work with
rawData <- read.celfiles(pathSampleCELfiles) # read CEL files of samples we want to work with.
normData <- oligo::rma(rawData, background=TRUE, normalize=TRUE, target=core)

ADD REPLY • link 5.5 years ago by salamandra ▴ 550

1

Entering edit mode

Ah, the issue is the array type, i.e., the U133. The target functionality only works for certain Affymetrix array designs, i.e., those that have a 'Gene' or 'Exon' in the name, and also usually have 'ST', reflecting the different probe design / layout.

So, you have to run rma() without target.

I can provide more information on the differences between these 2 broad classes of Affy array designs, if you wish.

If you wish to summarise over transcripts after you normalise with rma(), you can do something rudimentary like using aggregate() (base R) or avereps() (limma)

ADD REPLY • link 5.5 years ago by Kevin Blighe 89k

0

Entering edit mode

could you then provide that info please? so in this case, by doing rma() automatically summarizes by transcript instead of gene is that it?

ADD REPLY • link 5.5 years ago by salamandra ▴ 550

2

Entering edit mode

The ST 'Exon' and 'Gene' arrays differ from the U133 and earlier arrays as follows:

mismatch (MM) probes are mostly absent
to accommodate more targets, feature size reduced from 121µm^2 to 25µm^2
target entire length of gene, whereas U133 mainly targets 3` only
manufacturer-supplied annotations differ

The absence of MM probes affects downstream processing and means, for example, that the affy Bioconductor package cannot process ST arrays. Also, some normalisation (e.g. mas5) and QC methods that use perfect match (PM) and MM probes cannot be used with these arrays.

The fact that the U133 arrays only target a certain part of the mRNA (third point above) additionally means that probe summarisation via target is not required. You may still see duplicate gene entries after you translate the probe IDs to gene symbols, though.

ADD REPLY • link 5.5 years ago by Kevin Blighe 89k

0

Entering edit mode

even after translating probe IDs to ensembl/entrez gene ids..thank you

ADD REPLY • link 5.5 years ago by salamandra ▴ 550

0

Entering edit mode

I saw in another post that one probe-set targets one exon. So, does it mean that two probe-sets may be targeting two different exons of same gene, and that's why we get even after summarizing a U133 array different probe set ids corresponding to same ensembl gene id? Or is there only one gene per probe-set and we get same ensembl gene id for two probe-sets cause of annotation problems?

ADD REPLY • link 5.4 years ago by salamandra ▴ 550

1

Entering edit mode

I think that they may be targeting different transcript isoforms of the same gene, but cannot confirm. The UCSC genome browser actually stores Affymetrix probe IDs, so, you can paste in a probe ID to see what it targets. This may help to visualise better these probes that target the same gene. I have never done a comprehensive analysis / review of this, though... I'm not sure anybody has due to the fact that each array design differs from the next.

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k

0

Entering edit mode

If they are different isoforms, doing an average of two might not be appropriate cause one might no be expressed and that brings gene expression down..

ADD REPLY • link 5.4 years ago by salamandra ▴ 550

0

Entering edit mode

Yes, of course, in which case you will have to write some code to check for these situations in which the expression is so low for one probe such that it is negligible.

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k

0

Entering edit mode

From GeneAnnot site we can see that different probe sets that target same gene have variable numbers of same probes. For example for gene MAPK1 there is: 1552264_a_at probe set has probes 1 to 11, while 224621_at probe set has probes 1 to 3 and 8 to 11. For same probe set, different combinations of probes identify different transcripts, so the same probe set targets different transcript isoforms.

ADD REPLY • link 5.4 years ago by salamandra ▴ 550

0

Entering edit mode

That looks like a bit of a mess... but this, frequently, is how array designs appear when one looks deep into them.

The UCSC Genome Browser indicates that 224621_at targets the terminal exon, while the other (1552264_a_at) targets across the gene.

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k