Imagine I have a basket containing fruits, and I want to create a data frame in R
with various informations on these fruits. I would make several rounds of analysis to collect information, for instance, that I have 2 apples, 1 banana and 3 oranges, and that apples are red, bananas are yellow and oranges, well.. orange. Each round of information would be run by a software that outputs its findings in a simple text format. For instance, counts.txt
would contain:
apple count 2 banana count 1 orange count 3
A second software would output colours.txt
with:
apple colour red banana colour yellow orange colour orange
I would then load the files in a R
data frame and reshape it to obtain:
name colour count apple red 2 banana yellow 1 orange orange 3
Of course, in the real life I am doing this with samples for which there are sequence reads in FASTQ files which have been processed and produced metadata such as mapping rate, proportion of reads in exons or introns, etc.
I wonder if there is a set of tools or a standard procedure somewhere that does the same but is less ad-hoc than my approach. For the data input and output, I have seen "triples" in serialisation formats like Turtle, RDF, etc., or even plain JSON, but they are much more complicated (that is, much harder to produce with the usual Unix command-line tools) than tab-separated triplets of subject, verb, object. For the loading into R, maybe it is trivial enough that it never seemed to deserve a package in CRAN.
Am I missing something ? How do you organise similar works ? One of my concerns, is that while this workflow is good enough for me to load data in R
, I would like an approach that is equally friendly for other people programming in other languages.
Edit: I would also just be happy with pointers to other works following the same approach.
I am not sure if I get the gist of your problem (let me know if I misunderstood) but why does not a single file formatted like this work?:
Note that the first line is the header containing the meaning of each column.
I am running a QC pipeline, which is just a bunch of ad-hoc scripts that extract a few statistics about each sample by parsing BAM, BED and other output files. According to the needs of the day, I add new scripts or modify old ones. Thus, instead of directly producing one table, I produce tab-separated text files in form of subject-verb-object predicates like the first two examples, and I turn all these outputs into a single table in R later. Its number and order of columns may therefore vary easily. As long as I work alone, this fits well my needs. But for collaborative works, I wonder if that method has a name, if it has been compared with other approaches or formats (like tab-separated vs. JSON, ...). Basically, I would like to be able to say something smarter smarter to my colleagues than because it works for me, it should be good for them as well. However, my problem I do not find similar examples to refer to or study on the Internet, maybe because but I probably use the wrong keywords in my searches (triples, predicates, etc.)...
I read a bit more about Turtle. It looks like if I would ouptut files such as the following, I would be able to a) make SPARQL queries on the data (which I am not yet able to do), and b) work with these files in a more table-oriented way by discarding lines starting by
@
, removing the 4th column and ignoring the prefixes (fruit:
andproperties:
in column 1 and 2).I am mostly talking to myself, but I am still wondering if it would be worth the effort.
Following up on the post "An ontology for mapping statistics ?".