Question

How to filter out probsets and genes from Affymetrix microarray expression data matrix?

0

Entering edit mode

5.8 years ago

Jurat Shahidin ▴ 100

Hi

I am new to microarray data, currently using Affymetrix microarrays expression data for my experiments. Essentially, I have Affymetrix microarrays expression data matrix (Affymetrix probe-sets in rows (32830 probesets), and RNA samples in columns (735 samples)). I also have pheno data which contains metadata information of the above expression matrix (735 in rows (sample identifiers), and 6 description elements in columns).

initial attempt

load("data/HTA20_RMA.RData")
row_medArray <- Biobase::rowMedians(eset_HTA20)
RLE_data <- base::sweep(eset_HTA20,1,row_medArray)
RLE_data <- base::as.data.frame(RLE_data)

my question:

I find the limma case study is helpful but not whole. Basically, I am going to try the following steps:

how to make summarization of Affymetrix microarray expression matrix at gene level?
how to list out probsets intensities per gene?
how to filter out probsets and genes per sample?
how to add gene-level annotation to Affymetrix expression set?

I am wondering how to make this happen above steps? can anyone point me out how to lay out above workflow in R easily? any idea? Thanks

affymetrix microarray gene annotation • 3.9k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe 89k • written 5.8 years ago by Jurat Shahidin ▴ 100

2

Entering edit mode

Very similar questions cross posted to Bioconductor:

ADD REPLY • link 5.8 years ago by Gordon Smyth ★ 7.9k

0

Entering edit mode

I should avoid cross post, thanks for your reminding. Do I need to remove my post or just bring my attention next time?

ADD REPLY • link 5.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

IMO it is best not to remove the cross-posts at this stage because they've already been answered. Removing the duplicate questions would also remove people's answers.

ADD REPLY • link 5.8 years ago by Gordon Smyth ★ 7.9k

0

Entering edit mode

Thanks for your community effort, I would this keep in mind in my future post.

ADD REPLY • link 5.8 years ago by Jurat Shahidin ▴ 100

score 4 · Accepted Answer · 2019-06-19

4

Entering edit mode

5.8 years ago

Kevin Blighe 89k

how to make summarization of Affymetrix microarray expression matrix at gene level?

For summarisation, take a look at the target parameter that is passed to rma(). The 2 possible values are:

"core", will summarise to gene-level expression
"probeset", will summarise to probe-sets, usually meaning exons

The exact functioning of these parameters will depend on the microarray design - there are many layouts of probes and probe-sets. Your question seems generic for Affymetrix arrays (?)

See also my related answer here: C: Human Exon array probeset to gene-level expression

########

how to list out probsets intensities per gene?

See my comment regarding the use of target

########

how to filter out probsets and genes per sample?

You can filter before or after normalisation. Some people try to detect 'dudd' probes (probes that failed) prior to normalisation and filter these out; most people (from my experience), however, just include all probes / probe-sets. Obviously, for the process of normalisation, you need to include control probes that are used for background correction, et cetera. On Affymetrix platforms, control probes usually begin with 'Affx' - see here:

They may have other prefixes depending on the array type that you are using. For a complete picture, download the documentation from the Affymetrix / ThermoFisher website for the exact array type that you are using.

Another package is genefilter, which some use (I do not): https://bioconductor.org/packages/release/bioc/vignettes/genefilter/inst/doc/howtogenefilter.pdf

########

how to add gene-level annotation to Affymetrix expression set?

There are many, one being biomaRt: A: Affymetrix Human Genome U133 Plus 2.0 Array

There are also usually R-specific packages in Bioconductor, at least for the commonly-used arrays, which you can simply load into your ExpressionSet object. See here: https://support.bioconductor.org/p/63834/

Kevin

ADD COMMENT • link 4.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Dear Kevin:

I just saved myself from wrong understanding about what I am doing. Essentially, I am still trying to generate possible density plot for each gene, to see how good the normalization is done for preprocessed expression data. I think I can try the limma case study for now.

However, I am interested in quantifying Affymetrix expression data by genes with statistical method - coefficient of variation, in order to toss off the genes show fewer changes in expression by using the value of the coefficient of variation. Could you point me out how to lay this out in R?

ADD REPLY • link 5.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Would you mind take a look at the data source that I am experimenting?

Hey Jurat, sorry, we are 'just' a group of volunteers here.

...but, anyway, why are you using correlation to find the top 10 most expressed genes? Just normalise your raw data (CEL files) in the standard way using RMA, and then transform to Z-scale. Then, the gene(s) with the highest Z-scores can be assumed to be the highest expressed.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin:

in my case, row cell files are pretty big (around 40 GB), so I just use preprocessed Affymetrix gene expression data which I can't do the transformation. what I can do for preprocessed Affymetrix gene expression data matrix? I followed limma' user guide and some steps can't be applicable. Could you point me out how to correct my approach? thank you

when I tried to summarize expression data in this way, I get an error as follow:

> oligo::rma(eset_HTA20, target = "core", normalize = FALSE)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘rma’ for signature ‘"matrix"’

why did this happen?

ADD REPLY • link 5.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Which file did you obtain? - the Rdata file? That will contain the normalised, log2 expression data. You can use that for Limma and anything else downstream. You can also transform that to Z-scores.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin:

Thanks for your response. Yes, I used Rdata file and tried to summarize this expression data at the gene level and filter out probeset and genes, but didn't get the correct one. I intend to filter out gene because want to run PCA analysis on this expression data. I tried prcomp() for PCA analysis but it is not giving me satisfying results. How can I make this happen? Thanks again for your help, much appreciated.

ADD REPLY • link 5.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

So, in this Rdata file, the expression values are per gene or per exon? If they are per exon, you can summarise (by mean or median) to gene-level via the aggregate() function.

Can you define what you regard as 'satisfying' results by PCA?

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k