A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Advertisements