Question

Tool:Introducing Gallia: a Scala library for data manipulation

3

Entering edit mode

4.3 years ago

cros.anthony ▴ 30

Hi everyone,

This is an announcement for Gallia, a new Scala library for data manipulation that maintains a schema throughout transformations.

Of particular interest to bioinformaticians is this example processing a VCF file (namely Clinvar's), which involves a two steps process:

Generically parsing the VCF file (clinvar-agnostic)
Actually processing the clinvar-related data elements

This particular example basically turns VCF rows such as

#CHROM  POS      ID      REF  ALT  QUAL  FILTER  INFO
1       1049066  706774  G    A    .     .       AF_EXAC=0.00007;AF_TGP=0.00040;ALLELEID=694996;CLNDISDB=MONDO:MONDO:0014052,MedGen:C3808739,OMIM:615120;CLNDN=Myasthenic_syndrome,_congenital,_8;CLNHGVS=NC_000001.11:g.1049066G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=AGRN:375790;MC=SO:0001627|intron_variant;ORIGIN=1;RS=201995572

into a nested structure like:

  {
    "chromosome": "1",
    "position": 1049066,
    "_id": "706774",
    "ref": "G",
    "alt": "A",
    "clinvar_allele_id": "694996",
    "HGVS_expression": "NC_000001.11:g.1049066G>A",
    "variation_review_status": "criteria_provided,_single_submitter",
    "clinical_significance": "Benign",
    "allele_origin": [ "germline" ],
    "disease": [
      { "preferred_name": "Myasthenic_syndrome,_congenital,_8",
        "terms": [
          { "database": "MONDO",
            "id": "MONDO:0014052" },
          { "database": "MedGen",
            "id": "C3808739" },
          { "database": "OMIM",
            "id": "615120" } ] } ],
    "genes": [
      { "symbol": "AGRN",
        "entrez": "375790" } ],
    "molecular_consequences": [
      { "term": "SO:0001627",
        "name": "intron_variant" } ],
    "variant_type": {
      "name": "single_nucleotide_variant",
      "term": "SO:0001483" },
    "AF": {
      "EXAC": 0.00007,
      "1KGP": 0.00040 }
  }

This type of structure is more readily exploitable via a nosql store for example (to load into mongodb, to offer a faceted search via elasticsearch, ...).

Note that the library is domain-agnostic, and as such has no knowledge of what a VCF file is. It processes the file using the same mechanisms it would use for a dataset about finance or astronomy.

Another example that may be of interest on this forum is the reprocessing of the dbNSFP data, offering a similarly re-nested, more type-aware structure (see example input and output)

It should be noted that the library also abstracts Spark RDDs if scalability is of importance, though Spark is an optional dependency. Spark RDD processing can be invoked this way (though this part of the library is still rather bare).

For more information on Gallia, see the original announcement on the Scala mailing list, and the main article describing the library on Github.

I would love to hear whether people here think this is an effort worth pursuing!

Anthony (@anthony_cros)

logo

scala etl json vcf spark • 1.6k views

ADD COMMENT • link updated 23 months ago by Ram 45k • written 4.3 years ago by cros.anthony ▴ 30

score 0 · Answer 1 · 2021-03-05

Quick update:

Examples:

I added more examples of Gallia usage (e.g. Word Count, ...). Of particular interest to this forum would be the bioinformatics sub-section.

Two of the new examples stand out:

Re-structuring SnpEff's convoluted "ANN" value: https://github.com/galliaproject/gallia-snpeff#description
Providing a nested version of the Homo_Sapiens GeneMania data: https://github.com/galliaproject/gallia-genemania#description; resulting data is CC-BY-4.0-licensed (example output document)

Note that that GeneMania example above employs Gallia's "poor man" scaling - that is basically wrapping _GNU sort_ for wide operations - as the full data processing wouldn't fit in a consumer-grade computer's memory.

Codebase:

The code has been upgraded to Scala 2.13, with cross-compilation for 2.12

License:

Kicked off the process of adopting BSL as a license, the terms are being worked out

Contact:

Some people have reached out to me directly for questions and feedback - which is great - but don’t hesitate to provide input for others to see!