Hi- I have analyzed metagenomic (WGS) data with MetaPhlAn pipeline which gives relative abundance (out of 100) data of each taxon. I have two groups of data: control and test. I want to find out the Mean, Standard Error (SE), sample number (N)
of the control, and test group. My data is not normally distributed and for that I want it to be log transformed. For that, I have used the following function and transformed my dataset:
mk_logit <- function(x) log(x)
But, as my dataset is zero-inflated, all of the zeros (0) log-transformed into -Inf
. When they were used for further mean, SD calculation, most of them are producing NaN and Inf.
As a result, I am not getting proper result.
Can anyone please give me any solution/suggestion in order to get rid of this problem?
Thanks
You have what's called compositional data. Compositional data needs specific treatment as detailed in the book Statistical analysis of compositional data by John Aitchison. In short, to be able to use standard methods, one needs to preprocess the the data with the additive log-ratio transformation. Instead of the standard logarithm, you can use a generalized logarithm function such as the inverse hyperbolic sine (asinh in R) to deal with 0s. You may want to read the paper Microbiome Datasets Are Compositional: And This Is Not Optional.
Thanks a lot, Jean-Karim Heriche for your response. I will take a look into the article.