Dear all.
I have a question about how to work with the data I have.
The next table is an example of it:
Sample ID | Protein 1 | Protein 2 | Protein 3 | Protein 4 | Protein 5 |
---|---|---|---|---|---|
sample 1 | 0.75 | 0.87 | 0.23 | 0.16 | 0.47 |
sample 2 | 0.46 | NA | NA | 0.69 | 0.18 |
sample 3 | NA | 0.75 | NA | 0.62 | 0.35 |
sample 3 | 0.45 | 0.89 | NA | NA | NA |
sample 4 | NA | 0.13 | NA | 0.46 | 0.67 |
As you can see in the previous table, only the sample 3 has a replicated measurement. Besides, there is missingness in a couple of specific proteins. Therefore, to extract the mean of the values is not possible in all the proteins. How could I work with this data (sample 3)?
I don't even know whether it is a good idea to work with the mean (or any other kind of transformation) of sample 3, while the rest of them conserve the raw data. Besides, I don't know how to get only one value per protein for this sample.
If you can provide me any suggestion or paper to guide myself in how to proceed, I will appreciate it.
Thanks in advance,
Juan
It would help to know what these data even are and whether this is high- or low-throughput. Is this the full dataset and what is the analysis goal?
Hi.
The data correspond to metabolomics (MS/MS, 51 metabolites), and proteomics (sommalogic, 1400 proteins approx). I have 40 samples. From them, 10 have a replicated data.
I gonna use the data to apply sample clustering via dimensionality reduction approaches like iCluster or MOFA.
If you have any other question, please, do not hesitate to reply.
Kind regards,
Juan
Have a look at the Mass spectrometry and proteomics data analysis BioConductior workflow, it has a section on missing data and imputation - although short, it points to methods and packages to tackle the problem.
A few years ago I had a collaboration to analyze proteomics data. I had a minor role, but, at the time, the most common approach was to simply filter the data, e.g., by requiring a protein had been found in at least 20% of the samples. I think nowadays imputation is a common approach, but I am not up to speed with the subject.
Are the two sample 3 measurements technical replicates? If so, why do you have technical replicates for sample 3?