Hi everyone,
This is an announcement for Gallia, a new Scala library for data manipulation that maintains a schema throughout transformations.
Of particular interest to bioinformaticians is this example processing a VCF file (namely Clinvar's), which involves a two steps process:
- Generically parsing the VCF file (clinvar-agnostic)
- Actually processing the clinvar-related data elements
This particular example basically turns VCF rows such as
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1049066 706774 G A . . AF_EXAC=0.00007;AF_TGP=0.00040;ALLELEID=694996;CLNDISDB=MONDO:MONDO:0014052,MedGen:C3808739,OMIM:615120;CLNDN=Myasthenic_syndrome,_congenital,_8;CLNHGVS=NC_000001.11:g.1049066G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=AGRN:375790;MC=SO:0001627|intron_variant;ORIGIN=1;RS=201995572
into a nested structure like:
{
"chromosome": "1",
"position": 1049066,
"_id": "706774",
"ref": "G",
"alt": "A",
"clinvar_allele_id": "694996",
"HGVS_expression": "NC_000001.11:g.1049066G>A",
"variation_review_status": "criteria_provided,_single_submitter",
"clinical_significance": "Benign",
"allele_origin": [ "germline" ],
"disease": [
{ "preferred_name": "Myasthenic_syndrome,_congenital,_8",
"terms": [
{ "database": "MONDO",
"id": "MONDO:0014052" },
{ "database": "MedGen",
"id": "C3808739" },
{ "database": "OMIM",
"id": "615120" } ] } ],
"genes": [
{ "symbol": "AGRN",
"entrez": "375790" } ],
"molecular_consequences": [
{ "term": "SO:0001627",
"name": "intron_variant" } ],
"variant_type": {
"name": "single_nucleotide_variant",
"term": "SO:0001483" },
"AF": {
"EXAC": 0.00007,
"1KGP": 0.00040 }
}
This type of structure is more readily exploitable via a nosql store for example (to load into mongodb, to offer a faceted search via elasticsearch, ...).
Note that the library is domain-agnostic, and as such has no knowledge of what a VCF file is. It processes the file using the same mechanisms it would use for a dataset about finance or astronomy.
Another example that may be of interest on this forum is the reprocessing of the dbNSFP data, offering a similarly re-nested, more type-aware structure (see example input and output)
It should be noted that the library also abstracts Spark RDDs if scalability is of importance, though Spark is an optional dependency. Spark RDD processing can be invoked this way (though this part of the library is still rather bare).
For more information on Gallia, see the original announcement on the Scala mailing list, and the main article describing the library on Github.
I would love to hear whether people here think this is an effort worth pursuing!
Anthony (@anthony_cros)