Dear community,
I have a few questions about proteomic data analysis. I’m analyzing results from StPeter (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5891225/), a protein-level label-free quantification tool for MS/MS data, with an FDR of 1%.
I have two groups: control and mut (double knockouts). Each group consists of three samples. For the downstream analysis, I keep only the following detected/quantified proteins:
- Detected in all six samples;
- Detected only in all three control samples but not in any of the mut samples;
- Detected only in all three mut samples but not in any control samples.
Some proteins are expressed at levels between 30-60 ng, while others are only at 0.01-1 ng. The total number of quantified proteins after filtering is approximately 1,700. About 150 proteins are unique to the three control samples, and another 150 are unique to the three mut samples.
Before statistical analysis, I applied log2 transformation to the data.
Question 1: Do I need to perform any additional data processing (QC, normalization, etc.) besides StPeter’s built-in normalization? StPeter performs protein-level quantification of mass spectrometry data using MS/MS data. The primary unit of quantification is the Spectral Index, which extends spectral counting by integrating fragment ion peak intensity. The protein quantities are normalized by protein length and total spectral index abundance across the entire sample.
Question 2: What statistical analysis should I use? The proteins detected in both control and mut groups have a normal distribution. However, the proteins unique to each group make the entire dataset non-normally distributed. When I apply non-parametric tests to the entire dataset, they don’t identify any statistically significant differences between controls and muts, not even for the knockout proteins. This remains true even if I use a permutation test with 100,000 permutations.
How can I statistically correctly detect differences in such a dataset?
Thank you!
Thanks a lot, ATpoint! Limma with its M-estimation did the trick both mathematically and biologically. I appreciate your help!