Question

Statistical analysis of quantified proteomic data: ctrl vs. mut

0

Entering edit mode

6 months ago

C. Ryder • 0

Dear community,

I have a few questions about proteomic data analysis. I’m analyzing results from StPeter (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5891225/), a protein-level label-free quantification tool for MS/MS data, with an FDR of 1%.

I have two groups: control and mut (double knockouts). Each group consists of three samples. For the downstream analysis, I keep only the following detected/quantified proteins:

Detected in all six samples;
Detected only in all three control samples but not in any of the mut samples;
Detected only in all three mut samples but not in any control samples.

Some proteins are expressed at levels between 30-60 ng, while others are only at 0.01-1 ng. The total number of quantified proteins after filtering is approximately 1,700. About 150 proteins are unique to the three control samples, and another 150 are unique to the three mut samples.

Before statistical analysis, I applied log2 transformation to the data.

Question 1: Do I need to perform any additional data processing (QC, normalization, etc.) besides StPeter’s built-in normalization? StPeter performs protein-level quantification of mass spectrometry data using MS/MS data. The primary unit of quantification is the Spectral Index, which extends spectral counting by integrating fragment ion peak intensity. The protein quantities are normalized by protein length and total spectral index abundance across the entire sample.

Question 2: What statistical analysis should I use? The proteins detected in both control and mut groups have a normal distribution. However, the proteins unique to each group make the entire dataset non-normally distributed. When I apply non-parametric tests to the entire dataset, they don’t identify any statistically significant differences between controls and muts, not even for the knockout proteins. This remains true even if I use a permutation test with 100,000 permutations.

How can I statistically correctly detect differences in such a dataset?

Thank you!

statistics proteomics • 416 views

ADD COMMENT • link updated 6 months ago by ATpoint 86k • written 6 months ago by C. Ryder • 0

score 3 · Accepted Answer · 2024-06-25

3

Entering edit mode

6 months ago

ATpoint 86k

With only six total samples there is not much power for "tranditional" tests such as the Wilcox, as you experience. I would load the matrix of normalized data on log2 scale into limma. That is (to me) the best option you have. Cannot comment on the need of normalization, as I don't know the tool you use. I would check if proteins that should be the same are roughly the same, if so it's probably fine.

ADD COMMENT • link 6 months ago by ATpoint 86k

0

Entering edit mode

Thanks a lot, ATpoint! Limma with its M-estimation did the trick both mathematically and biologically. I appreciate your help!

ADD REPLY • link 5 months ago by C. Ryder • 0