Unbalanced experiment, samples with and without replicated data
1
0
Entering edit mode
4.1 years ago
judhenaosa ▴ 50

Dear all.

I have a question about how to work with the data I have.

The next table is an example of it:


Sample ID Protein 1 Protein 2 Protein 3 Protein 4 Protein 5
sample 1 0.75 0.87 0.23 0.16 0.47
sample 2 0.46 NA NA 0.69 0.18
sample 3 NA 0.75 NA 0.62 0.35
sample 3 0.45 0.89 NA NA NA
sample 4 NA 0.13 NA 0.46 0.67

As you can see in the previous table, only the sample 3 has a replicated measurement. Besides, there is missingness in a couple of specific proteins. Therefore, to extract the mean of the values is not possible in all the proteins. How could I work with this data (sample 3)?

I don't even know whether it is a good idea to work with the mean (or any other kind of transformation) of sample 3, while the rest of them conserve the raw data. Besides, I don't know how to get only one value per protein for this sample.

If you can provide me any suggestion or paper to guide myself in how to proceed, I will appreciate it.

Thanks in advance,

Juan

statistics biostatistics • 1.0k views
ADD COMMENT
0
Entering edit mode

It would help to know what these data even are and whether this is high- or low-throughput. Is this the full dataset and what is the analysis goal?

ADD REPLY
0
Entering edit mode

Hi.

The data correspond to metabolomics (MS/MS, 51 metabolites), and proteomics (sommalogic, 1400 proteins approx). I have 40 samples. From them, 10 have a replicated data.

I gonna use the data to apply sample clustering via dimensionality reduction approaches like iCluster or MOFA.

If you have any other question, please, do not hesitate to reply.

Kind regards,

Juan

ADD REPLY
0
Entering edit mode

Have a look at the Mass spectrometry and proteomics data analysis BioConductior workflow, it has a section on missing data and imputation - although short, it points to methods and packages to tackle the problem.

A few years ago I had a collaboration to analyze proteomics data. I had a minor role, but, at the time, the most common approach was to simply filter the data, e.g., by requiring a protein had been found in at least 20% of the samples. I think nowadays imputation is a common approach, but I am not up to speed with the subject.

Are the two sample 3 measurements technical replicates? If so, why do you have technical replicates for sample 3?

ADD REPLY
1
Entering edit mode
4.1 years ago

The term that covers the subject is

"missing data imputation"

Google this term to see various alternatives, the appropriate course of action depends on various characteristics of your data.

To be honest, if all your data is as sparse as what you show above you might have a hard time salvaging that.

ADD COMMENT
0
Entering edit mode

There are some standard approaches for imputation of MS intensity values afaik (really not my field) but that requires (again afaik) sufficient data points to first check whether the data indeed follow a certain distribution so that the imputation is reliable and does not simply "make up" data.

ADD REPLY

Login before adding your answer.

Traffic: 2283 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6