Question

Unbalanced experiment, samples with and without replicated data

0

Entering edit mode

4.2 years ago

judhenaosa ▴ 50

Dear all.

I have a question about how to work with the data I have.

The next table is an example of it:

Sample ID	Protein 1	Protein 2	Protein 3	Protein 4	Protein 5
sample 1	0.75	0.87	0.23	0.16	0.47
sample 2	0.46	NA	NA	0.69	0.18
sample 3	NA	0.75	NA	0.62	0.35
sample 3	0.45	0.89	NA	NA	NA
sample 4	NA	0.13	NA	0.46	0.67

As you can see in the previous table, only the sample 3 has a replicated measurement. Besides, there is missingness in a couple of specific proteins. Therefore, to extract the mean of the values is not possible in all the proteins. How could I work with this data (sample 3)?

I don't even know whether it is a good idea to work with the mean (or any other kind of transformation) of sample 3, while the rest of them conserve the raw data. Besides, I don't know how to get only one value per protein for this sample.

If you can provide me any suggestion or paper to guide myself in how to proceed, I will appreciate it.

Thanks in advance,

Juan

statistics biostatistics • 1.1k views

ADD COMMENT • link updated 4.2 years ago by Istvan Albert 102k • written 4.2 years ago by judhenaosa ▴ 50

0

Entering edit mode

It would help to know what these data even are and whether this is high- or low-throughput. Is this the full dataset and what is the analysis goal?

ADD REPLY • link 4.2 years ago by ATpoint 86k

0

Entering edit mode

Hi.

The data correspond to metabolomics (MS/MS, 51 metabolites), and proteomics (sommalogic, 1400 proteins approx). I have 40 samples. From them, 10 have a replicated data.

I gonna use the data to apply sample clustering via dimensionality reduction approaches like iCluster or MOFA.

If you have any other question, please, do not hesitate to reply.

Kind regards,

Juan

ADD REPLY • link 4.2 years ago by judhenaosa ▴ 50

0

Entering edit mode

Have a look at the Mass spectrometry and proteomics data analysis BioConductior workflow, it has a section on missing data and imputation - although short, it points to methods and packages to tackle the problem.

A few years ago I had a collaboration to analyze proteomics data. I had a minor role, but, at the time, the most common approach was to simply filter the data, e.g., by requiring a protein had been found in at least 20% of the samples. I think nowadays imputation is a common approach, but I am not up to speed with the subject.

Are the two sample 3 measurements technical replicates? If so, why do you have technical replicates for sample 3?

ADD REPLY • link 4.2 years ago by h.mon 35k

score 1 · Answer 1 · 2020-10-12

1

Entering edit mode

4.2 years ago

Istvan Albert 102k

The term that covers the subject is

"missing data imputation"

Google this term to see various alternatives, the appropriate course of action depends on various characteristics of your data.

To be honest, if all your data is as sparse as what you show above you might have a hard time salvaging that.

ADD COMMENT • link 4.2 years ago by Istvan Albert 102k

0

Entering edit mode

There are some standard approaches for imputation of MS intensity values afaik (really not my field) but that requires (again afaik) sufficient data points to first check whether the data indeed follow a certain distribution so that the imputation is reliable and does not simply "make up" data.

ADD REPLY • link 4.2 years ago by ATpoint 86k