Gene Expression - Check Multivariate Normal In R?
3
6
Entering edit mode
14.1 years ago

Hi,

I have a gene expression microarray dataset with dimensionality 427 x ~40,000.

I wish to test if this data follows a multivariate normal distibution. Within R in the mvnormtest library the mshapiro.test() function (Shapiro-Wilkes test) only permits vectors no longer than 5000 entries.

I also attempted using the mahalanobis distance squared ( when plotted on a QQ-plot it should generate a Chi-Squared distribution if the distibution of the data is normal). However, this requires the calculation of a covariance matrix which is not feasible for a data set this large (or wide).

Do you guys have any suggestions for alternative tests of multivariate normality for a large dataset preferably but not necessarily with R.

Regards, S ;-)

r statistics microarray • 7.8k views
ADD COMMENT
1
Entering edit mode

I doubt that the calculation of SW makes sense for the whole data-set. I will try to explain this in an answer later.

ADD REPLY
3
Entering edit mode
14.1 years ago

With really big datasets very small deviations from gaussian can be significant even though the t-test is tolerant to them. That said the increased sensitivity of parametric tests may not matter with such large datasets. Hence the KS-test is usually my 1st choice with this sort of data.

But answering your question, D'agostino-Pearson could perhaps be used, see here

ADD COMMENT
3
Entering edit mode
14.1 years ago
Michael 55k

Possibly you are thinking about doing a study like this? http://bioinformatics.oxfordjournals.org/content/19/17/2254.abstract

I assume you have >400 MA with 40000 probesets (how many replicates per condition)

Note that you should calculate the SW-test per probe-set/gene under the same experimental condition. Then you can get an estimate of which proportion of your genes have an error distribution significantly different from a normal.

The data on a single micro-array is highly unlikely to be normal anyway, because it contains genes of different expression level, e.g. consider you get a proportion of up-regulated genes down-regulated and "0"-regulated genes. Even if each of the population was normal in itself, the resulting mixture of Gaussians will not.

What you then do with the results is another story. Can be used to determine if a parametric test is applicable or not.

ADD COMMENT
1
Entering edit mode

If you have so many replicates, wilcoxon's rank-sum test should have sufficient power for a two-sample comparison. And then you won't rely on any normality assumption.

ADD REPLY
0
Entering edit mode

Yes, I definately accept your reasoning. I wish to test this merely as a diagnostic before implementing further analyses using ranks alone. Personally, non-parametric methods are very dissatisfying but alas, such is data.

ADD REPLY
2
Entering edit mode
14.1 years ago
Neilfws 49k

Good statistical advice in the answers above. For those looking for tests of multi-normality without the restrictions of mvnormtest, here are some options from CRAN Task View: Multivariate Statistics.

I've tested them on a small matrix (22283 x 6); note that the methods in the energy package can take a very long time to run.

ADD COMMENT

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6