Hi all,
I have RNAseq data from 3 biological replicates but only have log2 values for gene expression quantification.
I was wondering if there are any "cons" against using log2 values for calculating the standard deviation? Converting all expression values to absolute values using the In2 function is easy enough but when I calculate the SD using log2 vs base values, I always end up with different values. Surely this cant be right?
Does anyone have a preference for using base means values over log2 values for this type of analysis?
Many thanks in advance,
Ricky
PS: When comparing log2 and base mean values I do convert them into Ln2 and/or log2 to compare them both on the same scale, ie I am not comparing SD of Ln2 with log2 values.
I'm not clearly understanding what are you doing. But, sd will surely be different if you "scale" the data differently. By taking log, you are essentially scaling the data on a logarithmic scale.
Standard deviation is a measurement of spread of data around the mean of the data. When you take log, it shrinks the data, as well as its spread and consequently the standard deviation.
Thank you Santosh for your reply.
What I wanted to know if there was any advantage of using log2 over Ln2 values considering negative values etc. Although SD is based on a spread of data, it would be the same with log2 or Ln2 values.
I was just wondering if using one had an advantage of the other for RNAseq data.
Again, The SD will change depending on whether you are taking log2 or Ln2. This again due to the reason that there is a scaling involved: Log2(x) = Ln(x) * log2(e).
...considering negative values etc.
I'm not sure what you intended to say. Log can be taken only of positive numbers (not even zero). If there are count involved, one way to avoid taking log of zero is to add a small number (say 1) to all the quantities, before taking log. This will avoid taking log of Zeroes, while not change the log of others number much because of the addition of small number.
OK, thanks for the reply. However Im still confused (sorry, blame my lack of experience in big data sets and stats as bioinformatics is quite new to me. Also my inability to explain myself correctly).
When using the normal counts (ie not log2 converted data) for my expression data, I do not get a normal distribution as most data points are around 0 and therefore I can not use SD, t-test etc as I do not have a normal distribution. BTW, I have removed all normalised expression values of 0.
However when using the log2 data, I get a "normal-like" distribution where the centre of the curve is skewed to the right. Also, any data expression value below 1, is gonna be a negative value and hence my hesitation of using SD and therefore t-tests etc.
To sum up, normal (not log2) values, do not have a normal distribution and so I cant use SD, t-tests etc for my data. I have been using a wilcoxon rank test so far for p values.
Log2 converted values have a normal-like distribution, but have negative values and so I dont know if I can use SD, t-tests etc with these values.
I need to know which statistical test I should use with each dataset (non-log2 and/or log2). Can I use SD, t-test etc with the log2 expression values even though there are negative values?
Thanks
You are mixing up a lot of things together. I would strongly suggest you brush up basic statistical concepts, which will help you in long run in this field. https://www.openintro.org/stat/textbook.php
To sum up, normal (not log2) values, do not have a normal distribution
and so I cant use SD, t-tests etc for my data
SD is just a measure of spread of data, and it can be measured for any kind of data (not just only for normal distribution)
I have been using a wilcoxon rank test so far for p values.
Wilcoxon rank-sum test is a non-parametric test, and it doesn't assume anything about the underlying distribution (unlike t-test, which requires normal distribution of data). So it can be used with any kind of distributions, even the normal ones.
Log2 converted values have a normal-like distribution, but have negative values and so I dont know if I can use SD, t-tests etc with these values.
I'm not clearly understanding what are you doing. But, sd will surely be different if you "scale" the data differently. By taking log, you are essentially scaling the data on a logarithmic scale.
Standard deviation is a measurement of spread of data around the mean of the data. When you take log, it shrinks the data, as well as its spread and consequently the standard deviation.