Any moderately complex analysis uses data from multiple sources, some of which may be more trustworthy than others. A data source may be known to have problems, but be useful nevertheless, or its reliability may be unknown. Any results that touch bad data may become tainted, so how do you manage provenance in your own analyses? How do you communicate this to consumers of your data or code?
The accepted answer will need to be specific. Here are some hypothetical answers:
All data sources and versions are recorded in a README file stored with the analysis scripts and results.
Literate programming is used with data sources and versions recorded in the document.
New versions of data sources are regression tested with a set of standard analyses prior to use, to ensure that they give the expected results.
Within the analysis pipelines all data are annotated with a reliability score. If a function is called on data values with differing scores, the return value is assigned the mean/median/minimum of the input reliability scores.
Stock trading algorithms are used to annotate result reliability in realtime, based on customer feedback (!)
I'm interested because I'm writing analysis pipelines and would like to introduce automatic, fine-grained provenance tracking. Using Clojure metadata is one possible route to achieve this.
Not many steps to remember, easy to achieve and effective. Nice answer.