Question

A simple "predicate" format for loading (meta)data in R ? (or elsewhere)

0

Entering edit mode

8.1 years ago

Charles Plessy ★ 2.9k

Imagine I have a basket containing fruits, and I want to create a data frame in R with various informations on these fruits. I would make several rounds of analysis to collect information, for instance, that I have 2 apples, 1 banana and 3 oranges, and that apples are red, bananas are yellow and oranges, well.. orange. Each round of information would be run by a software that outputs its findings in a simple text format. For instance, counts.txt would contain:

apple   count   2
banana  count   1
orange  count   3

A second software would output colours.txt with:

apple   colour  red
banana  colour  yellow
orange  colour  orange

I would then load the files in a R data frame and reshape it to obtain:

name    colour  count
apple   red     2
banana  yellow  1
orange  orange  3

Of course, in the real life I am doing this with samples for which there are sequence reads in FASTQ files which have been processed and produced metadata such as mapping rate, proportion of reads in exons or introns, etc.

I wonder if there is a set of tools or a standard procedure somewhere that does the same but is less ad-hoc than my approach. For the data input and output, I have seen "triples" in serialisation formats like Turtle, RDF, etc., or even plain JSON, but they are much more complicated (that is, much harder to produce with the usual Unix command-line tools) than tab-separated triplets of subject, verb, object. For the loading into R, maybe it is trivial enough that it never seemed to deserve a package in CRAN.

Am I missing something ? How do you organise similar works ? One of my concerns, is that while this workflow is good enough for me to load data in R, I would like an approach that is equally friendly for other people programming in other languages.

Edit: I would also just be happy with pointers to other works following the same approach.

data format R • 2.1k views

ADD COMMENT • link 8.1 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

I am not sure if I get the gist of your problem (let me know if I misunderstood) but why does not a single file formatted like this work?:

 fruit    color  count
 apple     red       2
banana  yellow       1
orange  orange       3

Note that the first line is the header containing the meaning of each column.

ADD REPLY • link 8.1 years ago by ddiez ★ 2.0k

0

Entering edit mode

I am running a QC pipeline, which is just a bunch of ad-hoc scripts that extract a few statistics about each sample by parsing BAM, BED and other output files. According to the needs of the day, I add new scripts or modify old ones. Thus, instead of directly producing one table, I produce tab-separated text files in form of subject-verb-object predicates like the first two examples, and I turn all these outputs into a single table in R later. Its number and order of columns may therefore vary easily. As long as I work alone, this fits well my needs. But for collaborative works, I wonder if that method has a name, if it has been compared with other approaches or formats (like tab-separated vs. JSON, ...). Basically, I would like to be able to say something smarter smarter to my colleagues than because it works for me, it should be good for them as well. However, my problem I do not find similar examples to refer to or study on the Internet, maybe because but I probably use the wrong keywords in my searches (triples, predicates, etc.)...

ADD REPLY • link 8.1 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

I read a bit more about Turtle. It looks like if I would ouptut files such as the following, I would be able to a) make SPARQL queries on the data (which I am not yet able to do), and b) work with these files in a more table-oriented way by discarding lines starting by @, removing the 4th column and ignoring the prefixes (fruit: and properties: in column 1 and 2).

@prefix fruit: <http://example.com/fruits> .
@prefix property: <http://example.com/mypipeline/properties>.
fruit:apple property:count  2   .
fruit:banana    property:count  1   .
fruit:orange    property:count  3   .

I am mostly talking to myself, but I am still wondering if it would be worth the effort.