Debugging Ontologies using OWL Reasoning. Part 2: Unintentional Entailed Equivalence

This is part in a series on pragmatic techniques for debugging ontologies. This follows from part 1, which covered the basics of debugging using disjointness axioms using Protege and ROBOT.
In the previous part I outlined basic reasoner-based debugging using Protege and ROBOT. The goal was to detect and debug incoherent ontologies.

One potential problem that can arise is the inference of equivalence between two classes, where the equivalence is unintentional. The following example ontology from the previous post illustrates this:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS

In this case PeripheralNerve and Nerve are entailed to be mutually equivalent. You can see this in Protege, as the two classes are grouped together with an equivalence symbol linking them:

Screen Shot 2018-09-03 at 5.19.47 PM

As the explanation shows, the two classes are equivalent because (1) PNs are defined as any nerve in the PNS, and (2) nerve is asserted to be in the PNS.

We assume here that this is not the intent of the ontology developer; we assume they created distinct classes with distinct names as they believe them to be distinct. (Note that some ontologies such as SWEET employ equivalence axioms to denote two distinct terms that mean the same thing, but for this article we assume OBO-style ontology development).

When the ontology developer sees inferences like this, they will likely want to take some corrective action:

  • Under one scenario, the inference reveals to the ontology developer that in fact nerve and peripheral nerve are the same concept, and thus the two classes should be merged, with the label from one being retained as the synonym of the other.
  • Under the other scenario, the ontology developer realizes the concept they have been calling ‘Nerve’ encompasses more general neuron projection bundles found in the CNS; here they may decide to rename the concept (e.g. neuron projection bundle) and to eliminate or broaden the part_of axiom.

So far so good. But the challenge here is that an ontology with entailed equivalencies between pairs of classes is formally coherent: all classes are satisfiable, and there are no inconsistencies. It will not be caught by a pipeline that detects incoherencies such as unsatisfiable classes. This means you may end up accidentally releasing an ontology that has potentially serious biological problems. It also means we can’t use the same technique described in part 1 to make a debug module.

Formally we can state this as there being no unique class assumption in OWL. By creating two classes, c1 and c2, you are not saying that there is something that differentiates these, even if it is your intention that they are different.

Within the OBO ecosystem we generally strive to avoid equivalent named classes (principle of orthogonality). There are known cases where equivalent classes join two ontologies (for example, GO cell and CL cell), in general when we find additional entailed pairs of equivalent classes not originally asserted, it’s a problem. I would hypothesize this is frequently true of non-OBO ontologies too.

Detecting unintended equivalencies with ROBOT

For the reasons stated above, ROBOT has configurable behavior for when it encounters equivalent classes. This can be controlled via the –equivalent-classes-allowed (shorthand: “-e”) option on the reason command. There are 3 options here:

  • none: any entailed equivalence axiom between two named classes will result in an error being thrown
  • all: permit all equivalence axioms, entailed or asserted
  • asserted-only: permit entailed equivalence axioms only if they match an asserted equivalence axiom, otherwise throw an error

If you are unsure of what to do it’s always a good idea to start stringent and pass ‘none’. If it turns out you need to maintain asserted equivalencies (for example, the GO/CL ‘cell’ case), then you can switch to ‘asserted-only’.

The ‘all’ option is generally too permissive for most OBO ontologies. However, for some use cases this may be selected. For example, if your ontology imports multiple non-orthogonal ontologies plus bridging axioms and you are using reasoning to find new equivalence mappings.

For example, on our peripheral nerve ontology, if we run

robot reason -e asserted-only -r elk -i pn.omn

We will get:


ERROR org.obolibrary.robot.ReasonOperation - Only equivalent classes that have been asserted are allowed. Inferred equivalencies are forbidden.
ERROR org.obolibrary.robot.ReasonOperation - Equivalence: <http://example.org/Nerve&gt; == <http://example.org/PeripheralNerve&gt;

ROBOT will also exit with a non-zero exist code, ensuring that your release pipeline fails fast, preventing accidental release of broken ontologies.

Debugging false equivalence

This satisfies the requirement that potentially false equivalence can be detected, but how does the ontology developer debug this?

A typical Standard Operating Procedure might be:

  • IF robot fails with unsatisfiable classes
    • Open ontology in Protege and switch on Elk
    • Go to Inferred Classification
    • Navigate to Nothing
    • For each class under Nothing
      • Select the “?” to get explanations
  • IF robot fails with equivalence class pairs
    • Open ontology in Protege and switch on Elk
    • For each class reported by ROBOT
      • Navigate to class
      • Observe the inferred equivalence axiom (in yellow) and select ?

There are two problems with this SOP, one pragmatic and the other a matter of taste.

The pragmatic issue is that there is a Protege explanation workbench bug that sometimes renders Protege unable to show explanations for equivalence axioms in reasoners such as Elk (see this ticket). This is fairly serious for large ontologies (although for our simple example or for midsize ontologies use of HermiT may be perfectly feasible).

But even in the case where this bug is fixed or circumvented, the SOP above is suboptimal in my opinion. One reason is that it is simply more complicated: in contrast to the SOP for dealing with incoherent classes, it’s necessary to look at reports coming from outside Protege, perform additional seach and lookup. The more fundamental reason is the fact that the ontology is formally coherent even though it is defying my expectations to follow the unique class assumption. It is more elegant if we can directly encode my unique class assumption, and have the ontology be entailed to be incoherent when this is violated. That way we don’t have to bolt on additional SOP instructions or additional ad-hoc programmatic operations.

And crucially, it means the same ‘logic core dump’ operation described in the previous post can be used in exactly the same way.

Approach: SubClassOf means ProperSubClassOf

My approach here is to make explicit the assumption: every time an ontology developer asserts a SubClassOf axiom, they actually mean ProperSubClassOf.

To see exactly what this means, it helps to think in terms of Venn diagrams (Venn diagrams are my go-to strategy for explaining even the basics of OWL semantics). The OWL2 direct semantics are set-theoretic, with every class interpreted as a set, so this is a valid approach. When drawing Venn diagrams, sets are circles, and one circle being enclosed by another denotes subsetting. If circles overlap, this indicates set overlap, and if no overlap is shown the sets are assumed disjoint (have no members in common).

Let’s look at what happens when an ontology developer makes a SubClassOf link between PN and N. They may believe they are saying something like this:

Screen Shot 2018-09-03 at 5.12.16 PM

i.e. implicitly indicating that there are some nerves that are not peripheral nerves.

But in fact the OWL SubClassOf operator is interpreted set-theoretically as subset-or-equal-to (i.e. ) which can be visually depicted as:

Screen Shot 2018-09-03 at 5.13.03 PM

In this case our ontology developer wants to exclude the latter as a possibility (even if we end up with a model in which these two are equivalent, the ontology developer needs to arise at this conclusion by having the incoherencies in their own internal model revealed).

To make this explicit, there needs to be an additional class declared that (1) is disjoint from PN and (2) is a subtype of Nerve. We can think of this as a ProperSubClassOf axiom, which can be depicted visually as:

Screen Shot 2018-09-03 at 5.13.44 PM

If we encode this on our test ontology:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve

We can see that the ontology is inferred to be incoherent. There is no need for an additional post-hoc check: the generic incoherence detection mechanism of ROBOT does not need any special behavior, and the ontology editor sees all problematic classes in red, and can navigate to all problems by looking under owl:Nothing:

Screen Shot 2018-09-03 at 5.14.43 PM

Of course, we don’t want to manually assert this all the time, and litter our ontology with dreaded “OtherFoo” classes. If we can make the assumption that all asserted SubClassOfs are intended to be ProperSubClassOfs, then we can just do this procedurally as part of the ontology validation pipeline.

One way to do this is to inject a sibling for every class-parent pair and assert that the siblings are disjoint.

The following SPARQL will generate the disjoint siblings (if you don’t know SPARQL don’t worry, this can all be hidden for you):


prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt;
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
prefix owl: <http://www.w3.org/2002/07/owl#&gt;
CONSTRUCT {
?sibClass a owl:Class ;
owl:disjointWith ?c ;
rdfs:subClassOf ?p ;
rdfs:label ?sibLabel
}
WHERE {
?c rdfs:subClassOf ?p .
FILTER(isIRI(?c))
FILTER(isIRI(?p))
FILTER NOT EXISTS { ?c owl:deprecated "true"^^xsd:boolean }
OPTIONAL {
?c rdfs:label ?clabel
BIND(concat("DISJOINT-SIB-OF ", ?clabel) AS ?sibLabel)
}
BIND (UUID() as ?sibClass)
}

Note that we exclude deprecated/obsolete classes. The generated disjoint siblings are given a random UUID, and the label DISJOINT-SIB-OF X. You could also opt for the simpler “Other X” as in the above example, it doesn’t matter, only the ontology developer sees this, and only when debugging.

This can be encoded in a workflow, such that the axioms are injected as part of a test procedure. You likely do not want these axioms to leak out into the release version and confuse people.

Future versions of ROBOT may include a convenience function for doing this, but fow now you can do this in your Makefile:


SRC = pn.omn
disjoint_sibs.owl: $(SRC)
robot relax -i $< query --format ttl -c construct-disjoint-siblings.sparql $@
test.owl: $(SRC) disjoint_sibs.owl
robot merge -i $< -i disjoint_sibs.owl -o $@

Advertisements

New version of Ontology Development Kit – now with Docker support

This is an update to a previous post, creating an ontology project.

Version 1.1.2 of the ODK is available on GitHub.

The Ontology Development Kit (ODK; formerly ontology-starter-kit) provides a way of creating an ontology project ready for pushing to GitHub, with a number of features in place:

  • A Makefile that specifies your release workflow, including building imports, creating reports and running tests
  • Continuous integration: A .travis.yml file that configures Travis-CI to check any Pull Requests using ROBOT
  • A standard directory layout that makes working with different projects easier and more predictable
  • Standardized documentation and additional file artifacts
  • A procedure for releasing your ontologies using the GitHub release mechanism

The overall aim is to borrow as much from modern software engineering practice as possible and apply to the ontology development lifecycle.

The basic idea is fairly simple: a template folder contains a canonical repository layout, this is copied into a target area, with template variables substituted for user-supplied ones.

Some recent improvements include:

I will focus here on the adoption of Docker within the ODK. Most users of the ODK don’t really need to know much about Docker – just that they have to install it, and it runs their ontology workflow inside a container. This has multiple advantages – ontology developers don’t need to install a suite of semi-independent tools, and execution of workflows becomes more predictable and easier to debug, since the environment is standardized. I will provide a bit more detail here for people who are interested.

What is Docker?

From Wikipedia: Docker is a program that performs operating-system-level virtualization also known as containerizationDocker can run containers on your machine, where each container bundles its own tools and environments.
Docker architecture

Docker containers: from Docker 101

A common use case for Docker is deploying services. In this particular case we’re not deploying a service but are instead using Docker as a means of providing and controlling a standard environment with various command line tools.

The ODK Docker container

The ODK docker container, known as odkfull is available from obolibrary organization on Dockerhub. It comes bundled with a number of tools, as of the latest release:

  • A standard unix environment, including GNU Make
  • ROBOT v1.1.0  (Java)
  • Dead Simple OWL Design Patterns (DOSDP) tools v0.9.0 (Scala)
  • Associated python tooling for DOSDPs (Python3)
  • OWLTools (for older workflows) (Java)
  • The odk seed script (perl)

There are a few different scenarios in which an odkfull container is executed

  • As a one-time run when setting up a repository using seed-via-docker.sh (which wraps a script that does the actual work)
  • After initial setup and pushing to GitHub, ontology developers may wish to execute parts of the workflow locally – for example, extracting an import module after adding new seeds for external ontology classes
  • Travis-CI uses the same container used by ontology developers
  • Embedding within a larger pipeline

Setting up a repo

Typing

./seed-via-docker.sh

Will initiate the process of making a new repo, depositing the results in the target/ folder. This is all done within a container. The seed process will generate a workflow in the form of a Makefile, and then run that workflow, all in the container. The final step of pushing the repo to GitHub is currently done by the user directly in their own environment, rather than from within the container.

Running parts of the workflow

Note that repos built from the odk will include a one-line script in the src/ontology folder* called “run.sh”. This is a simple wrapper for running the docker container. (if you built your repo from an earlier odk, then you can simply copy this script).

Now, instead of typing

make test

The ontology developer can now type

./run.sh make test

The former requires the user has all the relevant tooling installed (which at least requires Xcode on OS-X, which not all curators have). The latter will only require Docker.

Travis execution

Note that the .travis.yml file generated will be configured to run the travis job in an odkfull container. If you generated your repo using an earlier odk, you can manually adapt your existing travis file.

Is this the right approach?

Docker may seem like quite heavyweight for something like running an ontology pipeline. Before deciding on this path, we did some tests on some volunteers in GO who were not familiar with Docker. These editors had a need to rebuild import files frequently, and having everyone install their own tools has not worked out so well in the past. Preliminary results seem to indicate the editors are happy with this approach.

It may be the case that in future more can be triggered directly from within Protege. Or some ontology environments such as Tawny-OWL are powerful enough to do everything from one tool chain.  But for now the reality is that many ontology workflows require a fairly heterogeneous set of tools to operate, and there is frequently a need to use multiple languages, which complicates the install environment. Docker provides a nice way to unify this.

We’ll put this into practice at ICBO this week, in the Phenotype Ontology and OBO workshops.

Acknowledgments

Thanks to the many developers and testers: David Osumi-Sutherland, Nico Matentzoglu, Jim Balhoff, Eric Douglass, Marie-Angelique Laporte, Rebecca Tauber, James Overton, Nicole Vasilevsky, Pier Luigi Buttigieg, Kim Rutherford, Sofia Robb, Damion Dooley, Citlalli Mejía Almonte, Melissa Haendel, David Hill, Matthew Lange.

More help and testers wanted! See: https://github.com/INCATools/ontology-development-kit/issues

Footnotes

* Footnote: the Makefile has always lived in the src/ontology folder, but the build process requires the whole repo, so the run.sh wrapper maps two levels up. It looks a little odd, but it works. In future if there is demand we may switch the Makefile to being in the root folder.

Debugging Ontologies using OWL Reasoning. Part 1: Basics and Disjoint Classes axioms

This is the first part in a series on pragmatic techniques for debugging ontologies. See also part 2

All software developers are familiar with the concept of debugging, a process for finding faults in a program. The term ‘bug’ has been used in engineering since the 19th century, and was used by Grace Hopper to describe a literal bug gumming up the works of the Mark II computer. Since then, debugging and debugging tools have become ubiquitous in computing, and the modern software developer is fortunate enough to have a large array of tools and techniques at their disposal. These include unit tests, assertions and interactive debuggers.

original bug
The original bug

Ontology development has many parallels with software development, so it’s reasonable to assume that debugging techniques from software can be carried over to ontologies. I’ve previously written about use of continuous integration in ontology development, and it is now standard to use Travis to check pull requests on ontologies. Of course, there are important differences between software and ontology development. Unlike typical computer programs, ontologies are not executed, so the concept of an interactive debugger stepping through an execution sequence doesn’t quite translate to ontologies. However, there are still a wealth of tooling options for ontology developers, many of which are under-used.

There is a great deal of excellent academic material on the topic of ontology debugging; see for example the 2013 and 2014 proceedings of the excellently named Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM), or the seminal Debugging OWL Ontologies. However, many ontology developers may not be aware of some of the more basic ‘blue collar’ techniques in use for ontology debugging.

Using OWL Reasoning and disjointness axioms to debug ontologies

In my own experience one of the most effective means of finding problems in ontologies is through the use of OWL reasoning. Reasoning is frequently used for automated classification, and this is supported in tools such as ROBOT through the reason command. In addition to classification, reasoning can also be used to debug an ontology, usually by inferring if the ontology is incoherent. The term ‘incoherent’ isn’t a value judgment here; it’s a technical term for an ontology that is either inconsistent or contains unsatisfiable classes, as described in this article by Robert Stevens, Uli Sattler and Phillip Lord.

A reasoner will not find bugs without some help from you, the ontology developer.

Screen Shot 2018-08-02 at 5.22.05 PM

You have to impart some of your own knowledge of the domain into the ontology in order for incoherency to be detected. This is usually done by adding axioms that constrain the space of what is possible. The ontogenesis article has a nice example using red blood cells and the ‘only’ construct. I will give another example using the DisjointClasses axiom type; in my experience, working on large inter-related ontologies disjointness axioms are one of the most effective ways of finding bugs (and has the added advantage of being within the profile of OWL understood by Elk).

Let’s take the following example, a slice of an anatomical ontology dealing with cranial nerves. The underlying challenge here is the fact that the second cranial nerve (the optic nerve) is not in fact a nerve as it is part of the central nervous system (CNS), whereas true nerves as part of the peripheral nervous system (PNS). This seeming inconsistency has plagued different anatomy ontologies.

Ontology: <http://example.org>
Prefix: : <http://example.org/>
ObjectProperty: part_of
Class: CNS
Class: PNS
Class: StructureOfPNS EquivalentTo: part_of some PNS
Class: StructureOfCNS EquivalentTo: part_of some CNS
DisjointClasses: StructureOfPNS, StructureOfCNS
Class: Nerve SubClassOf: part_of some PNS
Class: CranialNerve SubClassOf: Nerve
Class: CranialNerveII SubClassOf: CranialNerve, part_of some CNS

cns-pns-disjoint

You may have noted this example uses slightly artificial classes of the form “Structure of X”. These are not strictly necessary, we’ll return to this when we discuss Generic Class Inclusion (GCI) axioms in a future part.

If we load this into Protege, switch on the reasoner, we will see that CranialNerveII shows up red, indicating it is unsatisfiable, rendering the ontology incoherent. We can easily find all unsatisfiable classes under the ‘Nothing’ builtin class on the inferred hierarchy view. Clicking on the ‘?’ button will make Protege show an explanation, such as the following:

Screen Shot 2018-08-02 at 5.28.59 PM

This shows all the axioms that lead to the conclusion that CranialNerveII is unsatisfiable. At least one of these axioms must be wrong (for example, the assumption that all cranial nerves are nerves may be terminologically justified, but could be wrong here; or perhaps it is the assumption that CN II is actually a cranial nerve; or we may simply want to relax the constraint and allow spatial overlap between peripheral and central nervous system parts). The ontology developer can then set about fixing the ontology until it is coherent.

Detecting incoherencies as part of a workflow

Protege provides a nice way of finding ontology incoherencies, and of debugging them by examining explanations. However, it is still possible to accidentally release an incoherent ontology, since the ontology editor is not compelled to check for unsatisfiabilities in Protege prior to saving. It may even be possible for an incoherency to be inadvertently introduced through changes to an upstream dependency, for example, by rebuilding an import module.

Luckily, if you are using ROBOT to manage your release process, then it should be all but impossible for you to accidentally release an incoherent ontology. This is because the robot reason command will throw an error if the ontology is incoherent. If you are using robot as part of a Makefile-based workflow (as configured by the ontology starter kit) then this will block progression to the next step, as ROBOT returns with a non-zero exit code when performing a reasoner operation on an incoherent ontology. Similarly, if you are using Travis-CI to vet pull requests or check the current ontology state, then the travis build will automatically fail if an incoherency is encountered.

robot-workflow

ROBOT reason flow diagram. Exiting with system code 0 indicates success, non-zero failure.

Running robot reason on our example ontology yields:

$ robot reason -r ELK -i cranial.omn
ERROR org.obolibrary.robot.ReasonerHelper - There are 1 unsatisfiable classes in the ontology.
ERROR org.obolibrary.robot.ReasonerHelper -     unsatisfiable: http://example.org/CranialNerveII

Generating debug modules – incoherent SLME

Large ontologies can strain the limits of the laptop computers usually used to develop ontologies. It can be useful to make something analogous to a ‘core dump’ in software debugging — a standalone minimal component that can be used to reproduce the bug. This is a module extract (using a normal technique like SLME) seeded by all unsatisfiable classes (there may be multiple). This provides sufficient axioms to generate all explanations, plus additional context.

I use the term ‘unsatisfiable module’ for this artefact. This can be done using the robot reason command with the “–debug-unsatisfiable” option.

In our Makefiles we often have a target like this:

debug.owl: my-ont.owl
        robot reason -i  $< -r ELK -D $@

If the ontology is incoherent then “make debug.owl” will make a small-ish standalone file that can be easily shared and quickly loaded in Protege for debugging. The ontology will be self-contained with no imports – however, if the axioms come from different ontologies in an import chain, then each axiom will be annotated with the source ontology, making it easier for you to track down the problematic import. This can be very useful for large ontologies with multiple dependencies, where there may be different versions of the same ontology in different import chains. 

Coming up

The next article will deal with the case of detecting unwanted equivalence axioms in ontologies, and future articles in the series will deal with practical tips on how best to use disjointness axioms and other constraints in your ontologies.

Carry on reading: Part 2, Unintentional Entailed Equivalence

Further Reading

Acknowledgments

Thanks to Nico Matentzoglu for comments on a draft of this post.

A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

A lightweight ontology registry system

For a number of years, I have been one of the maintainers of the registry that underpins the list of ontologies at the Open Biological Ontologies Foundry/Library (http://obofoundry.org). I also built some of the infrastructure that creates nightly builds of each ontology, verifying it and providing versions in both obo format and owl.

The original system grew organically and was driven by an ultra-simple file called “ontologies.txt“, stored on google code. This grew to be supplemented by a collection of other files for maintaining the list of issue trackers, together with additional metadata to maintain the central OBO builds. The imminent demise of google code and the general creakiness and inflexibility of the old system has prompted the search for a new solution. I wanted something that would make it much easier for ontology providers to update their information, but at the same time allow the central OBO group the ability to vet and correct entries. We needed something more sophisticated than a flat key-value list, yet not overly complex. We also wanted something compatible with semantic web standards (i.e. to have an RDF file with a description of every ontology it it, using standard vocabularies and ontologies for the properties and classes). We also wanted it to look a bit nicer than the old site, which was looking decidedly 2000-and-late.

Screen Shot 2015-08-26 at 7.57.26 PM

The legacy OBOFoundry site, looking dated and missing key information

What are some of the options here?

  • A centralized wiki, with a page for each ontology, and groups updating their entry on the wiki
  • Each group embeds the metadata about the ontology in a website they maintain. This is then periodically harvested by the central registry. Options for embedding the metadata include microdata and RDFa
  • Each group maintains their own metadata in or alongside their ontology in rdf/owl, and this is periodically harvested
  • Piggy back off of an existing registry, e.g. BioPortal
  • A bespoke registry system, designed from the ground up, with its own relational database underpinning it, etc

These are good all solutions in the appropriate context, but none fitted our requirements precisely. Wikis are best for unstructured or loosely structred narrative text, but attempts to embed structured information inside wikis have been less than satisfactory. The microdata/RDFa approach is interesting, but not practical for us. Microdata is inherently limited in terms of extensibility, and RDFa is complex for many users. Additionally it requires both that groups produce their own web sites (many rely on the OBO Foundry to do this for them), and that we both harvest the metadata and relinquish control. As mentioned previously, it is useful for the OBO repository administrators to have certain fields be filled in centrally (sometimes for policy reasons, sometimes technical).  The same concerns underpin the fully decentralized approach, in which every group maintains the metadata directly as part of the ontology, and we harvest this.

Existing registries are built for their own requirements. A bespoke registry system is attractive in many ways, as this can be highly customized, but this can be expensive and we lacked the resources for this.

Solution: GitHub pages and “YAML-LD”

I initially prototyped a solution making use of the GitHub pages framework, driven by YAML files. This can be considered a kind of bespoke system, contradicting what I said above. But rather than roll the entire framework, the system is really just some templates glueing together some existing systems. GitHub support for social coding and YAML helped a lot. The system was very quick to develop and it soon morphed into the actual system to replace the old OBO site.

YAML

YAML is a markup language that superficially resembles the tag-value stanza format we were previously using, but crucially allows for nesting. Here is an example of a snippet of YAML for a cephalopod ontology:

id: ceph
title: Cephalopod Ontology
contact:
  email: cjmungall@lbl.gov
  label: Chris Mungall
description: An anatomical and developmental ontology for cephalopods
taxon:
  id: NCBITaxon:6605
  label: Cephalopod

Note that certain tags have ‘objects’ as their fields, e.g. contact and taxon.

We stick to the subset of YAML that can be represented in JSON, and we can thus define a JSON-LD context, allowing for a direct translation to RDF, which is nice. This part is still being finalized, but the basic idea is that keys like ‘title’ will be mapped to dc:title, and the taxon CURIE will be expanded to the full PURL for cephalopoda.

The basic idea is to manage each ontologies metadata as a separate YAML file in a GitHub repository. GitHub features nice builtin YAML rendering, and files can be edited via the GitHub web-interface, which is YAML-aware.

The list of metadata files are here. Note that these are markdown files ( the .md stands for markdown, not metadata). YAML can actually be embedded in Markdown, so each file is a mini-webpage for the ontology with the metadata embedded right in there. This is in some ways similar to the microdata/RDFa approach but IMHO much more elegant.

GitHub Pages

Each markdown file is rendered attractively through the GitHub interface – for example, here is the md file for the environment ontology, rendered using the builtin GitHub md renderer. Note the yaml block contains structured data and the rest of the file can contain any mixture of markdown and HTML which is rendered on the page. We can do better than this using GitHub pages. Using a simple static site generator and templating system (Jekyll/liquid) we can render each page using our own CSS with our own format. For example here is ENVO again, but rendered using Jekyll. Note that we aren’t even running our own webserver here, this is all a service provided for us, in keeping with our desire to keep things lightweight and resource-light.

Screen Shot 2015-08-26 at 10.43.30 PM

The entire system consists of a few HTML templates plus a single python script that derives an uber-metadata file that powers the central table (visible on the front page).

Distributed editing

Where the system really shines is the distributed and social editing model. All of this comes for free when hosted on GitHub (in theory GitLab or some other sites should work). Anyone can come along and fork the OBOFoundry.github.io github repository into their own userspace and make edits – they can even do this without leaving their web browser (see the Edit button on the bottom left of every OBO ontology page).

What’s to stop some vandal trashing the registry? Crucially, any edits made by a non-owner remains in their own fork until they issue a Pull Request. After that, someone from OBO will come along and either merge in the pull request, or close it (giving a reason why they did not merge of course). The version control system maintains a full audit trail of this, premature merges can be rolled back, etc.

The task of the OBO team is made easier thanks to Travis-CI, a Continuous Integration system integrated into GitHub. I configured the OBOFoundry github site with a Travis configuration file that instructs Travis to check every pushed commit using an automated test suite – this ensures that people editing their yaml files don’t make syntax errors, or omit crucial metadata.

github merge page with travis check

Screenshot of GitHub pull request, showing a passed Travis check

I have previously written about the use of Continuous Integration in ontology development – although CI was developed primarily for software engineering products, it works surprisingly well for ontologies and for metadata. This is perhaps not surprising if we consider these engineered artefacts in the way software is.

The whole end-to-end process is documented in this FAQ entry on the site.

The system has been working extremely well and is popular among the ontology groups that contribute their expertise to OBO – before official launch of the new site, we had 31 closed pull requests. Whereas previously a member of the OBO team would have to coordinate with the ontology provider to enter the metadata (a time consuming process prone to errors and backlogs), now the provider has the ability to enter information themselves, with the benefit of validation from Travis and the OBO team.

Other features

The new site has many other improvements over the last one. It’s not possible to distinguish between the ontology sensu the the umbrella entity vs individual ontology products or editions. For example, the various editions of Uberon (basic, core, composite metazoan) can each be individually registered and validated. There are also a growing number of properties that can be associated with the ontology, from a twitter handle to logos to custom browsers. Hopefully some of these features will be useful to the OBO community. Of course, the overall look could still be massively improved easily by someone with some web design chops (it’s very bland generic bootstrap at the moment). But this isn’t really the point of this post, which is more about the application of a certain set of technologies to allow a balance between centralization and distributed editing that suits the needs of the OBO Foundry. Leveraging existing services like GitHub pages, Travis and the GitHub fork-and-pull-request model allows us to get more mileage for less effort.

The future of metadata

The new OBO site was inspired in many ways by the system developed by my colleague Jorrit Poelen for the Global Biotic Interactions database (GloBI), in which simple JSON metadata files describing each interaction dataset are provided in individual GitHub repositories. A central system periodically harvests these into a large searchable index, where different datasets are integrated. This is not so different from common practice among software developers, who provide metadata for their project in the form of pom.xml files and package.json files (not out of their love of metadata, but more because this provides a useful service or is necessary for working in a wider ecosystem, and integrating with other software components). As James Malone points out, it makes far more sense to simply pull this existing metadata rather than force developers to register in a monolithic rigid centralized registry. If there are incentives  for providers of any kind of information artefacts (software, ontologies, datasets) to provide richer metadata at source in large already-existing open repositories such as GitHub then it does away with the need to build separately funded large monolithic registries. The new OBO system and the GloBI approach are demonstrating some of these incentives for ontologies and datasets. The current OBO system still has a large centralized aspect, due in part to the nature of the OBO Foundry, but in future may become more distributed.


Chris Mungall