Debugging Ontologies using OWL Reasoning. Part 2: Unintentional Entailed Equivalence

This is part in a series on pragmatic techniques for debugging ontologies. This follows from part 1, which covered the basics of debugging using disjointness axioms using Protege and ROBOT.
In the previous part I outlined basic reasoner-based debugging using Protege and ROBOT. The goal was to detect and debug incoherent ontologies.

One potential problem that can arise is the inference of equivalence between two classes, where the equivalence is unintentional. The following example ontology from the previous post illustrates this:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS

In this case PeripheralNerve and Nerve are entailed to be mutually equivalent. You can see this in Protege, as the two classes are grouped together with an equivalence symbol linking them:

Screen Shot 2018-09-03 at 5.19.47 PM

As the explanation shows, the two classes are equivalent because (1) PNs are defined as any nerve in the PNS, and (2) nerve is asserted to be in the PNS.

We assume here that this is not the intent of the ontology developer; we assume they created distinct classes with distinct names as they believe them to be distinct. (Note that some ontologies such as SWEET employ equivalence axioms to denote two distinct terms that mean the same thing, but for this article we assume OBO-style ontology development).

When the ontology developer sees inferences like this, they will likely want to take some corrective action:

  • Under one scenario, the inference reveals to the ontology developer that in fact nerve and peripheral nerve are the same concept, and thus the two classes should be merged, with the label from one being retained as the synonym of the other.
  • Under the other scenario, the ontology developer realizes the concept they have been calling ‘Nerve’ encompasses more general neuron projection bundles found in the CNS; here they may decide to rename the concept (e.g. neuron projection bundle) and to eliminate or broaden the part_of axiom.

So far so good. But the challenge here is that an ontology with entailed equivalencies between pairs of classes is formally coherent: all classes are satisfiable, and there are no inconsistencies. It will not be caught by a pipeline that detects incoherencies such as unsatisfiable classes. This means you may end up accidentally releasing an ontology that has potentially serious biological problems. It also means we can’t use the same technique described in part 1 to make a debug module.

Formally we can state this as there being no unique class assumption in OWL. By creating two classes, c1 and c2, you are not saying that there is something that differentiates these, even if it is your intention that they are different.

Within the OBO ecosystem we generally strive to avoid equivalent named classes (principle of orthogonality). There are known cases where equivalent classes join two ontologies (for example, GO cell and CL cell), in general when we find additional entailed pairs of equivalent classes not originally asserted, it’s a problem. I would hypothesize this is frequently true of non-OBO ontologies too.

Detecting unintended equivalencies with ROBOT

For the reasons stated above, ROBOT has configurable behavior for when it encounters equivalent classes. This can be controlled via the –equivalent-classes-allowed (shorthand: “-e”) option on the reason command. There are 3 options here:

  • none: any entailed equivalence axiom between two named classes will result in an error being thrown
  • all: permit all equivalence axioms, entailed or asserted
  • asserted-only: permit entailed equivalence axioms only if they match an asserted equivalence axiom, otherwise throw an error

If you are unsure of what to do it’s always a good idea to start stringent and pass ‘none’. If it turns out you need to maintain asserted equivalencies (for example, the GO/CL ‘cell’ case), then you can switch to ‘asserted-only’.

The ‘all’ option is generally too permissive for most OBO ontologies. However, for some use cases this may be selected. For example, if your ontology imports multiple non-orthogonal ontologies plus bridging axioms and you are using reasoning to find new equivalence mappings.

For example, on our peripheral nerve ontology, if we run

robot reason -e asserted-only -r elk -i pn.omn

We will get:


ERROR org.obolibrary.robot.ReasonOperation - Only equivalent classes that have been asserted are allowed. Inferred equivalencies are forbidden.
ERROR org.obolibrary.robot.ReasonOperation - Equivalence: <http://example.org/Nerve&gt; == <http://example.org/PeripheralNerve&gt;

ROBOT will also exit with a non-zero exist code, ensuring that your release pipeline fails fast, preventing accidental release of broken ontologies.

Debugging false equivalence

This satisfies the requirement that potentially false equivalence can be detected, but how does the ontology developer debug this?

A typical Standard Operating Procedure might be:

  • IF robot fails with unsatisfiable classes
    • Open ontology in Protege and switch on Elk
    • Go to Inferred Classification
    • Navigate to Nothing
    • For each class under Nothing
      • Select the “?” to get explanations
  • IF robot fails with equivalence class pairs
    • Open ontology in Protege and switch on Elk
    • For each class reported by ROBOT
      • Navigate to class
      • Observe the inferred equivalence axiom (in yellow) and select ?

There are two problems with this SOP, one pragmatic and the other a matter of taste.

The pragmatic issue is that there is a Protege explanation workbench bug that sometimes renders Protege unable to show explanations for equivalence axioms in reasoners such as Elk (see this ticket). This is fairly serious for large ontologies (although for our simple example or for midsize ontologies use of HermiT may be perfectly feasible).

But even in the case where this bug is fixed or circumvented, the SOP above is suboptimal in my opinion. One reason is that it is simply more complicated: in contrast to the SOP for dealing with incoherent classes, it’s necessary to look at reports coming from outside Protege, perform additional seach and lookup. The more fundamental reason is the fact that the ontology is formally coherent even though it is defying my expectations to follow the unique class assumption. It is more elegant if we can directly encode my unique class assumption, and have the ontology be entailed to be incoherent when this is violated. That way we don’t have to bolt on additional SOP instructions or additional ad-hoc programmatic operations.

And crucially, it means the same ‘logic core dump’ operation described in the previous post can be used in exactly the same way.

Approach: SubClassOf means ProperSubClassOf

My approach here is to make explicit the assumption: every time an ontology developer asserts a SubClassOf axiom, they actually mean ProperSubClassOf.

To see exactly what this means, it helps to think in terms of Venn diagrams (Venn diagrams are my go-to strategy for explaining even the basics of OWL semantics). The OWL2 direct semantics are set-theoretic, with every class interpreted as a set, so this is a valid approach. When drawing Venn diagrams, sets are circles, and one circle being enclosed by another denotes subsetting. If circles overlap, this indicates set overlap, and if no overlap is shown the sets are assumed disjoint (have no members in common).

Let’s look at what happens when an ontology developer makes a SubClassOf link between PN and N. They may believe they are saying something like this:

Screen Shot 2018-09-03 at 5.12.16 PM

i.e. implicitly indicating that there are some nerves that are not peripheral nerves.

But in fact the OWL SubClassOf operator is interpreted set-theoretically as subset-or-equal-to (i.e. ) which can be visually depicted as:

Screen Shot 2018-09-03 at 5.13.03 PM

In this case our ontology developer wants to exclude the latter as a possibility (even if we end up with a model in which these two are equivalent, the ontology developer needs to arise at this conclusion by having the incoherencies in their own internal model revealed).

To make this explicit, there needs to be an additional class declared that (1) is disjoint from PN and (2) is a subtype of Nerve. We can think of this as a ProperSubClassOf axiom, which can be depicted visually as:

Screen Shot 2018-09-03 at 5.13.44 PM

If we encode this on our test ontology:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve

We can see that the ontology is inferred to be incoherent. There is no need for an additional post-hoc check: the generic incoherence detection mechanism of ROBOT does not need any special behavior, and the ontology editor sees all problematic classes in red, and can navigate to all problems by looking under owl:Nothing:

Screen Shot 2018-09-03 at 5.14.43 PM

Of course, we don’t want to manually assert this all the time, and litter our ontology with dreaded “OtherFoo” classes. If we can make the assumption that all asserted SubClassOfs are intended to be ProperSubClassOfs, then we can just do this procedurally as part of the ontology validation pipeline.

One way to do this is to inject a sibling for every class-parent pair and assert that the siblings are disjoint.

The following SPARQL will generate the disjoint siblings (if you don’t know SPARQL don’t worry, this can all be hidden for you):


prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt;
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
prefix owl: <http://www.w3.org/2002/07/owl#&gt;
CONSTRUCT {
?sibClass a owl:Class ;
owl:disjointWith ?c ;
rdfs:subClassOf ?p ;
rdfs:label ?sibLabel
}
WHERE {
?c rdfs:subClassOf ?p .
FILTER(isIRI(?c))
FILTER(isIRI(?p))
FILTER NOT EXISTS { ?c owl:deprecated "true"^^xsd:boolean }
OPTIONAL {
?c rdfs:label ?clabel
BIND(concat("DISJOINT-SIB-OF ", ?clabel) AS ?sibLabel)
}
BIND (UUID() as ?sibClass)
}

Note that we exclude deprecated/obsolete classes. The generated disjoint siblings are given a random UUID, and the label DISJOINT-SIB-OF X. You could also opt for the simpler “Other X” as in the above example, it doesn’t matter, only the ontology developer sees this, and only when debugging.

This can be encoded in a workflow, such that the axioms are injected as part of a test procedure. You likely do not want these axioms to leak out into the release version and confuse people.

Future versions of ROBOT may include a convenience function for doing this, but fow now you can do this in your Makefile:


SRC = pn.omn
disjoint_sibs.owl: $(SRC)
robot relax -i $< query --format ttl -c construct-disjoint-siblings.sparql $@
test.owl: $(SRC) disjoint_sibs.owl
robot merge -i $< -i disjoint_sibs.owl -o $@

Advertisements

New version of Ontology Development Kit – now with Docker support

This is an update to a previous post, creating an ontology project.

Version 1.1.2 of the ODK is available on GitHub.

The Ontology Development Kit (ODK; formerly ontology-starter-kit) provides a way of creating an ontology project ready for pushing to GitHub, with a number of features in place:

  • A Makefile that specifies your release workflow, including building imports, creating reports and running tests
  • Continuous integration: A .travis.yml file that configures Travis-CI to check any Pull Requests using ROBOT
  • A standard directory layout that makes working with different projects easier and more predictable
  • Standardized documentation and additional file artifacts
  • A procedure for releasing your ontologies using the GitHub release mechanism

The overall aim is to borrow as much from modern software engineering practice as possible and apply to the ontology development lifecycle.

The basic idea is fairly simple: a template folder contains a canonical repository layout, this is copied into a target area, with template variables substituted for user-supplied ones.

Some recent improvements include:

I will focus here on the adoption of Docker within the ODK. Most users of the ODK don’t really need to know much about Docker – just that they have to install it, and it runs their ontology workflow inside a container. This has multiple advantages – ontology developers don’t need to install a suite of semi-independent tools, and execution of workflows becomes more predictable and easier to debug, since the environment is standardized. I will provide a bit more detail here for people who are interested.

What is Docker?

From Wikipedia: Docker is a program that performs operating-system-level virtualization also known as containerizationDocker can run containers on your machine, where each container bundles its own tools and environments.
Docker architecture

Docker containers: from Docker 101

A common use case for Docker is deploying services. In this particular case we’re not deploying a service but are instead using Docker as a means of providing and controlling a standard environment with various command line tools.

The ODK Docker container

The ODK docker container, known as odkfull is available from obolibrary organization on Dockerhub. It comes bundled with a number of tools, as of the latest release:

  • A standard unix environment, including GNU Make
  • ROBOT v1.1.0  (Java)
  • Dead Simple OWL Design Patterns (DOSDP) tools v0.9.0 (Scala)
  • Associated python tooling for DOSDPs (Python3)
  • OWLTools (for older workflows) (Java)
  • The odk seed script (perl)

There are a few different scenarios in which an odkfull container is executed

  • As a one-time run when setting up a repository using seed-via-docker.sh (which wraps a script that does the actual work)
  • After initial setup and pushing to GitHub, ontology developers may wish to execute parts of the workflow locally – for example, extracting an import module after adding new seeds for external ontology classes
  • Travis-CI uses the same container used by ontology developers
  • Embedding within a larger pipeline

Setting up a repo

Typing

./seed-via-docker.sh

Will initiate the process of making a new repo, depositing the results in the target/ folder. This is all done within a container. The seed process will generate a workflow in the form of a Makefile, and then run that workflow, all in the container. The final step of pushing the repo to GitHub is currently done by the user directly in their own environment, rather than from within the container.

Running parts of the workflow

Note that repos built from the odk will include a one-line script in the src/ontology folder* called “run.sh”. This is a simple wrapper for running the docker container. (if you built your repo from an earlier odk, then you can simply copy this script).

Now, instead of typing

make test

The ontology developer can now type

./run.sh make test

The former requires the user has all the relevant tooling installed (which at least requires Xcode on OS-X, which not all curators have). The latter will only require Docker.

Travis execution

Note that the .travis.yml file generated will be configured to run the travis job in an odkfull container. If you generated your repo using an earlier odk, you can manually adapt your existing travis file.

Is this the right approach?

Docker may seem like quite heavyweight for something like running an ontology pipeline. Before deciding on this path, we did some tests on some volunteers in GO who were not familiar with Docker. These editors had a need to rebuild import files frequently, and having everyone install their own tools has not worked out so well in the past. Preliminary results seem to indicate the editors are happy with this approach.

It may be the case that in future more can be triggered directly from within Protege. Or some ontology environments such as Tawny-OWL are powerful enough to do everything from one tool chain.  But for now the reality is that many ontology workflows require a fairly heterogeneous set of tools to operate, and there is frequently a need to use multiple languages, which complicates the install environment. Docker provides a nice way to unify this.

We’ll put this into practice at ICBO this week, in the Phenotype Ontology and OBO workshops.

Acknowledgments

Thanks to the many developers and testers: David Osumi-Sutherland, Nico Matentzoglu, Jim Balhoff, Eric Douglass, Marie-Angelique Laporte, Rebecca Tauber, James Overton, Nicole Vasilevsky, Pier Luigi Buttigieg, Kim Rutherford, Sofia Robb, Damion Dooley, Citlalli Mejía Almonte, Melissa Haendel, David Hill, Matthew Lange.

More help and testers wanted! See: https://github.com/INCATools/ontology-development-kit/issues

Footnotes

* Footnote: the Makefile has always lived in the src/ontology folder, but the build process requires the whole repo, so the run.sh wrapper maps two levels up. It looks a little odd, but it works. In future if there is demand we may switch the Makefile to being in the root folder.

Debugging Ontologies using OWL Reasoning. Part 1: Basics and Disjoint Classes axioms

This is the first part in a series on pragmatic techniques for debugging ontologies.

All software developers are familiar with the concept of debugging, a process for finding faults in a program. The term ‘bug’ has been used in engineering since the 19th century, and was used by Grace Hopper to describe a literal bug gumming up the works of the Mark II computer. Since then, debugging and debugging tools have become ubiquitous in computing, and the modern software developer is fortunate enough to have a large array of tools and techniques at their disposal. These include unit tests, assertions and interactive debuggers.

original bug
The original bug

Ontology development has many parallels with software development, so it’s reasonable to assume that debugging techniques from software can be carried over to ontologies. I’ve previously written about use of continuous integration in ontology development, and it is now standard to use Travis to check pull requests on ontologies. Of course, there are important differences between software and ontology development. Unlike typical computer programs, ontologies are not executed, so the concept of an interactive debugger stepping through an execution sequence doesn’t quite translate to ontologies. However, there are still a wealth of tooling options for ontology developers, many of which are under-used.

There is a great deal of excellent academic material on the topic of ontology debugging; see for example the 2013 and 2014 proceedings of the excellently named Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM), or the seminal Debugging OWL Ontologies. However, many ontology developers may not be aware of some of the more basic ‘blue collar’ techniques in use for ontology debugging.

Using OWL Reasoning and disjointness axioms to debug ontologies

In my own experience one of the most effective means of finding problems in ontologies is through the use of OWL reasoning. Reasoning is frequently used for automated classification, and this is supported in tools such as ROBOT through the reason command. In addition to classification, reasoning can also be used to debug an ontology, usually by inferring if the ontology is incoherent. The term ‘incoherent’ isn’t a value judgment here; it’s a technical term for an ontology that is either inconsistent or contains unsatisfiable classes, as described in this article by Robert Stevens, Uli Sattler and Phillip Lord.

A reasoner will not find bugs without some help from you, the ontology developer.

Screen Shot 2018-08-02 at 5.22.05 PM

You have to impart some of your own knowledge of the domain into the ontology in order for incoherency to be detected. This is usually done by adding axioms that constrain the space of what is possible. The ontogenesis article has a nice example using red blood cells and the ‘only’ construct. I will give another example using the DisjointClasses axiom type; in my experience, working on large inter-related ontologies disjointness axioms are one of the most effective ways of finding bugs (and has the added advantage of being within the profile of OWL understood by Elk).

Let’s take the following example, a slice of an anatomical ontology dealing with cranial nerves. The underlying challenge here is the fact that the second cranial nerve (the optic nerve) is not in fact a nerve as it is part of the central nervous system (CNS), whereas true nerves as part of the peripheral nervous system (PNS). This seeming inconsistency has plagued different anatomy ontologies.

Ontology: <http://example.org>
Prefix: : <http://example.org/>
ObjectProperty: part_of
Class: CNS
Class: PNS
Class: StructureOfPNS EquivalentTo: part_of some PNS
Class: StructureOfCNS EquivalentTo: part_of some CNS
DisjointClasses: StructureOfPNS, StructureOfCNS
Class: Nerve SubClassOf: part_of some PNS
Class: CranialNerve SubClassOf: Nerve
Class: CranialNerveII SubClassOf: CranialNerve, part_of some CNS

cns-pns-disjoint

You may have noted this example uses slightly artificial classes of the form “Structure of X”. These are not strictly necessary, we’ll return to this when we discuss Generic Class Inclusion (GCI) axioms in a future part.

If we load this into Protege, switch on the reasoner, we will see that CranialNerveII shows up red, indicating it is unsatisfiable, rendering the ontology incoherent. We can easily find all unsatisfiable classes under the ‘Nothing’ builtin class on the inferred hierarchy view. Clicking on the ‘?’ button will make Protege show an explanation, such as the following:

Screen Shot 2018-08-02 at 5.28.59 PM

This shows all the axioms that lead to the conclusion that CranialNerveII is unsatisfiable. At least one of these axioms must be wrong (for example, the assumption that all cranial nerves are nerves may be terminologically justified, but could be wrong here; or perhaps it is the assumption that CN II is actually a cranial nerve; or we may simply want to relax the constraint and allow spatial overlap between peripheral and central nervous system parts). The ontology developer can then set about fixing the ontology until it is coherent.

Detecting incoherencies as part of a workflow

Protege provides a nice way of finding ontology incoherencies, and of debugging them by examining explanations. However, it is still possible to accidentally release an incoherent ontology, since the ontology editor is not compelled to check for unsatisfiabilities in Protege prior to saving. It may even be possible for an incoherency to be inadvertently introduced through changes to an upstream dependency, for example, by rebuilding an import module.

Luckily, if you are using ROBOT to manage your release process, then it should be all but impossible for you to accidentally release an incoherent ontology. This is because the robot reason command will throw an error if the ontology is incoherent. If you are using robot as part of a Makefile-based workflow (as configured by the ontology starter kit) then this will block progression to the next step, as ROBOT returns with a non-zero exit code when performing a reasoner operation on an incoherent ontology. Similarly, if you are using Travis-CI to vet pull requests or check the current ontology state, then the travis build will automatically fail if an incoherency is encountered.

robot-workflow

ROBOT reason flow diagram. Exiting with system code 0 indicates success, non-zero failure.

Running robot reason on our example ontology yields:

$ robot reason -r ELK -i cranial.omn
ERROR org.obolibrary.robot.ReasonerHelper - There are 1 unsatisfiable classes in the ontology.
ERROR org.obolibrary.robot.ReasonerHelper -     unsatisfiable: http://example.org/CranialNerveII

Generating debug modules – incoherent SLME

Large ontologies can strain the limits of the laptop computers usually used to develop ontologies. It can be useful to make something analogous to a ‘core dump’ in software debugging — a standalone minimal component that can be used to reproduce the bug. This is a module extract (using a normal technique like SLME) seeded by all unsatisfiable classes (there may be multiple). This provides sufficient axioms to generate all explanations, plus additional context.

I use the term ‘unsatisfiable module’ for this artefact. This can be done using the robot reason command with the “–debug-unsatisfiable” option.

In our Makefiles we often have a target like this:

debug.owl: my-ont.owl
        robot reason -i  $< -r ELK -D $@

If the ontology is incoherent then “make debug.owl” will make a small-ish standalone file that can be easily shared and quickly loaded in Protege for debugging. The ontology will be self-contained with no imports – however, if the axioms come from different ontologies in an import chain, then each axiom will be annotated with the source ontology, making it easier for you to track down the problematic import. This can be very useful for large ontologies with multiple dependencies, where there may be different versions of the same ontology in different import chains. 

Coming up

The next article will deal with the case of detecting unwanted equivalence axioms in ontologies, and future articles in the series will deal with practical tips on how best to use disjointness axioms and other constraints in your ontologies.

Further Reading

Acknowledgments

Thanks to Nico Matentzoglu for comments on a draft of this post.

A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

owljs – a javascript library for OWL hacking

owljs ia a javascript library for doing stuff with OWL. It’s available from github:

https://github.com/cmungall/owljs

Whilst it attempts to following CommonJS, you currently have to use RingoJS  (a Rhino engine) as it makes use of JVM calls to the OWLAPI

owl plus rhino equals fun

 

Why javascript?

Why javascript you may ask? Isn’t that a hacky language run in browsers? In fact javascript is increasingly used on the server side as well as in browsers, as can be seen in the success of node.js. With Java 8 incorporating Nashorn as a scripting engine, it looks like javascript on the server side is here to stay.

Why not just use java? Java can be very verbose and is not the ideal language for writing short ontology processing scripts, especially with the OWL API.

There are a number of other languages better suited to scripting and declarative programming in general, many of which run on the JVM. This includes

  • Groovy – a popular choice for interfacing with the OWL API
  • The Armed Bear flavor of Common Lisp, as used in LSW2.
  • Clojure, a variant of lisp, as used in Phil Lord’s powerful Tawny-OWL framework.
  • Scala, a superbly elegant functional programming language used to great effect in Jim Balhoff’s beautifully elegant scowl.
  • Iron Python – a popular choice for interfacing with the Brain. And of course, Python is the de facto language for bioinformatics these days

There are also offerings in non-JVM languages such as my own posh – in addition most languages provide some kind of RDF library, but this can often be low level for working in OWL.

I decided to write a javascript library for a number of reasons. Our group already produces a lot of javascript code, most of which can be run on the server. For example, the golr libraries used in the AmiGO 2 codebase are CommonJS, as are those used for the Monarch API. Thse APIs all access ontologies through services (and can thus be run on a non-JVM javascript engine like node), and we would not make these APIs depend on a JVM. However, the ability to go the other way is useful – in a powerful ontology processing environment that offers all the features of the OWL API, being able to access all kinds of bioinformatics data through ready-made javascript APIs.

Another reason is that JSON is ubiquitous, and having your data format be a subset of the language has some major advantages.

Plus, after an initial period of ambivalence, I have grown to really like javascript. It’s just functional enough to do some cool things.

What can you do with it?

I hope to provide some specific examples later on this blog. For now, take a look at the docs on github. Major features are:

Stay tuned for more information!

 

 

 

 

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates: