Recently I got a problem puzzled me -- how do you know the data you are analyzing is indeed what you wanted and how to prove it using bioinformatic methods?(In a other word, how do you know your data is true?)
I have searched google for a long time but can't find a proper answer. So, I will appreciate if you can give me some advice!
ADD COMMENT
• link
updated 18 months ago by
Ram
44k
•
written 7.3 years ago by
Hughie
▴
80
2
Entering edit mode
I changed this to forum because it looks more like a discussion thread. I have the impression that you are referring to "correctness of annotation" like:
I was given data that was annotated as "mouse kidney" but the tubes
got mixed up, maybe it is now "brain or liver". How do I find out?
Thank you Michael for your kindly behavior!
I meant closer to "measurement error", and do you know some bioinformatic methods to judge?
Thank you again.
If your data exists in a computer and can be analysed by bioinformatics methods, it must be true, no?
Maybe if you really explain what you mean you can get an useful answer. What kind of data are you talking about? Sequencing? Simulations? Real data? Hypothetical data?
What do you mean by "the data you are analyzing is indeed what you wanted"? No contaminations? The data reflects some experimental conditions?
What do you mean by "your data is true"? You think someone created some fake data?
I once received microarray gene expression data to analyse. 3 conditions, 9 samples, 3 replicates per each. As soon as I received the data I normalized and did simple hierarchical clustering with 1k,2k,3k highly variable probes. We had some prior idea of how the clusters should look because of the conditions (healthy, tumor, tumor treated). We also received the order of samples filled in the chip. But in clusters array position-6 and array position-9 were always swapped. We were skeptical that something was definitely wrong. We went back to person who filled the chip with samples.
Guess what, the person who did the experiment has some confusion between numbers 6 and 9. Hence the skeptical result.
The point is, for microarray data it is little easier to find this kind of flaws. But for sequence data once should be careful enough from the beginning to control these mistakes. Again it depends on what kind of sequencing data you are handling to find out the data you are analysing is the data you intended. If you have some prior idea of the samples (like sample-A has P53 mutation, sample-B has BRCA mutation), it would be easy to identify these mutations from WGS/WES and validate/believe the entire process went well. For ChIP-sequencing experiments, I would believe everyone involved in the process are working carefully and go with the analysis (which I am doing now).
Exchange of sample labeling has happened to me, not once, not twice but thrice, both with sequencing and microarray! It is easy to detect it through clustering and correlations, however, if the mixed-up sample is completely unrelated, that might be a problem, depending upon how much close it is to the "true" samples.
If the samples are sequenced in pairs i.e. tumor and matched normal, one can use methods like NGSCheckMate to confirm both samples are from same patient.
if the mixed-up sample is completely unrelated, that might be a problem
o_O Hope this never happens to anyone given the cost of the experiment :)
Hi, what do you mean by mislabeling? what if I am confused about if I am mislable my sample, input and igG? Could we figule it out easily? Thanks a lot!
What is your definition of truth? If you are just worried about the samples being mislabeled/exchanged, then there are various means to diagnose it. Also given the fact that samples are run in replicates, if something odd happens to one or some of them, you might see it from various computational analysis because the odd man will stand out. Then there is truth at biological level: what if my sampling is not the right one representing my biology? This is harder to detect, but not impossible. Experiments are run usually either to validate already existing hypotheses, or to generate further hypotheses to get deeper insight into the biology. In either cases, you also have some idea of some positive and negative controls. If you are working with neurons, you would normally expect genes related to neuronal function to show in any kind of analysis. In contrast, you might not expect a lot of myocardial genes in neuronal experiments, but if you get them you try to figure out their correlation. If you find new correlations, you got new science (and a Nobel perhaps!); however, if it doesn't make any sense with the existing knowledge of your experiment, then, well, may be your experiment was doomed.
Just do the things, and don't worry too much about theoretical Qs like this one. If there is "true" Biology in your samples, it will probably come out by your analysis. In contrast, if it is false, it will be detected if you know your hypotheses, biology and experiments correctly.
I mean that if I got experimental samples and sequenced them, how can
I assert these data is want I need which reflects some experimental
conditions.
That is very subjective. You can probably tackle the "assert" part if you know an independent truth (e.g. it is common for large projects to genotype the samples so the sequence results can be verified by calling SNP's, if a question arises about mixups, I don't think this is done as a common practice).
The experimental part is going to be more tenuous. Sequence is invariant as opposed to experimental conditions. Unless it was a specific type of sequencing (e.g. RNAseq in response to some condition) it would be difficult to make a direct connection (one case I can think of is a mutation(s) which could be confirmed by plain sequencing).
I changed this to forum because it looks more like a discussion thread. I have the impression that you are referring to "correctness of annotation" like:
or do you refer to "fabricated/fake data"?
or do you refer to "measurement error"?
Thank you Michael for your kindly behavior! I meant closer to "measurement error", and do you know some bioinformatic methods to judge? Thank you again.
As you didn't mention a specific type of experiment:
If your data exists in a computer and can be analysed by bioinformatics methods, it must be true, no?
Maybe if you really explain what you mean you can get an useful answer. What kind of data are you talking about? Sequencing? Simulations? Real data? Hypothetical data?
What do you mean by "the data you are analyzing is indeed what you wanted"? No contaminations? The data reflects some experimental conditions?
What do you mean by "your data is true"? You think someone created some fake data?
Thank you! h.mon
I'm sorry for my amphibolous explaination.
I mean that if I got experimental samples and sequenced them, how can I assert these data is want I need which reflects some experimental conditions.
You can ignore "your data is true".