Proposed strategy for semantics in RDF* and Property Graphs

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z

Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)

proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png

Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png

protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)

Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

 

 

 

 

 

 

Advertisements

Debugging Ontologies using OWL Reasoning. Part 2: Unintentional Entailed Equivalence

This is part in a series on pragmatic techniques for debugging ontologies. This follows from part 1, which covered the basics of debugging using disjointness axioms using Protege and ROBOT.
In the previous part I outlined basic reasoner-based debugging using Protege and ROBOT. The goal was to detect and debug incoherent ontologies.

One potential problem that can arise is the inference of equivalence between two classes, where the equivalence is unintentional. The following example ontology from the previous post illustrates this:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS

In this case PeripheralNerve and Nerve are entailed to be mutually equivalent. You can see this in Protege, as the two classes are grouped together with an equivalence symbol linking them:

Screen Shot 2018-09-03 at 5.19.47 PM

As the explanation shows, the two classes are equivalent because (1) PNs are defined as any nerve in the PNS, and (2) nerve is asserted to be in the PNS.

We assume here that this is not the intent of the ontology developer; we assume they created distinct classes with distinct names as they believe them to be distinct. (Note that some ontologies such as SWEET employ equivalence axioms to denote two distinct terms that mean the same thing, but for this article we assume OBO-style ontology development).

When the ontology developer sees inferences like this, they will likely want to take some corrective action:

  • Under one scenario, the inference reveals to the ontology developer that in fact nerve and peripheral nerve are the same concept, and thus the two classes should be merged, with the label from one being retained as the synonym of the other.
  • Under the other scenario, the ontology developer realizes the concept they have been calling ‘Nerve’ encompasses more general neuron projection bundles found in the CNS; here they may decide to rename the concept (e.g. neuron projection bundle) and to eliminate or broaden the part_of axiom.

So far so good. But the challenge here is that an ontology with entailed equivalencies between pairs of classes is formally coherent: all classes are satisfiable, and there are no inconsistencies. It will not be caught by a pipeline that detects incoherencies such as unsatisfiable classes. This means you may end up accidentally releasing an ontology that has potentially serious biological problems. It also means we can’t use the same technique described in part 1 to make a debug module.

Formally we can state this as there being no unique class assumption in OWL. By creating two classes, c1 and c2, you are not saying that there is something that differentiates these, even if it is your intention that they are different.

Within the OBO ecosystem we generally strive to avoid equivalent named classes (principle of orthogonality). There are known cases where equivalent classes join two ontologies (for example, GO cell and CL cell), in general when we find additional entailed pairs of equivalent classes not originally asserted, it’s a problem. I would hypothesize this is frequently true of non-OBO ontologies too.

Detecting unintended equivalencies with ROBOT

For the reasons stated above, ROBOT has configurable behavior for when it encounters equivalent classes. This can be controlled via the –equivalent-classes-allowed (shorthand: “-e”) option on the reason command. There are 3 options here:

  • none: any entailed equivalence axiom between two named classes will result in an error being thrown
  • all: permit all equivalence axioms, entailed or asserted
  • asserted-only: permit entailed equivalence axioms only if they match an asserted equivalence axiom, otherwise throw an error

If you are unsure of what to do it’s always a good idea to start stringent and pass ‘none’. If it turns out you need to maintain asserted equivalencies (for example, the GO/CL ‘cell’ case), then you can switch to ‘asserted-only’.

The ‘all’ option is generally too permissive for most OBO ontologies. However, for some use cases this may be selected. For example, if your ontology imports multiple non-orthogonal ontologies plus bridging axioms and you are using reasoning to find new equivalence mappings.

For example, on our peripheral nerve ontology, if we run

robot reason -e asserted-only -r elk -i pn.omn

We will get:


ERROR org.obolibrary.robot.ReasonOperation - Only equivalent classes that have been asserted are allowed. Inferred equivalencies are forbidden.
ERROR org.obolibrary.robot.ReasonOperation - Equivalence: <http://example.org/Nerve&gt; == <http://example.org/PeripheralNerve&gt;

ROBOT will also exit with a non-zero exist code, ensuring that your release pipeline fails fast, preventing accidental release of broken ontologies.

Debugging false equivalence

This satisfies the requirement that potentially false equivalence can be detected, but how does the ontology developer debug this?

A typical Standard Operating Procedure might be:

  • IF robot fails with unsatisfiable classes
    • Open ontology in Protege and switch on Elk
    • Go to Inferred Classification
    • Navigate to Nothing
    • For each class under Nothing
      • Select the “?” to get explanations
  • IF robot fails with equivalence class pairs
    • Open ontology in Protege and switch on Elk
    • For each class reported by ROBOT
      • Navigate to class
      • Observe the inferred equivalence axiom (in yellow) and select ?

There are two problems with this SOP, one pragmatic and the other a matter of taste.

The pragmatic issue is that there is a Protege explanation workbench bug that sometimes renders Protege unable to show explanations for equivalence axioms in reasoners such as Elk (see this ticket). This is fairly serious for large ontologies (although for our simple example or for midsize ontologies use of HermiT may be perfectly feasible).

But even in the case where this bug is fixed or circumvented, the SOP above is suboptimal in my opinion. One reason is that it is simply more complicated: in contrast to the SOP for dealing with incoherent classes, it’s necessary to look at reports coming from outside Protege, perform additional seach and lookup. The more fundamental reason is the fact that the ontology is formally coherent even though it is defying my expectations to follow the unique class assumption. It is more elegant if we can directly encode my unique class assumption, and have the ontology be entailed to be incoherent when this is violated. That way we don’t have to bolt on additional SOP instructions or additional ad-hoc programmatic operations.

And crucially, it means the same ‘logic core dump’ operation described in the previous post can be used in exactly the same way.

Approach: SubClassOf means ProperSubClassOf

My approach here is to make explicit the assumption: every time an ontology developer asserts a SubClassOf axiom, they actually mean ProperSubClassOf.

To see exactly what this means, it helps to think in terms of Venn diagrams (Venn diagrams are my go-to strategy for explaining even the basics of OWL semantics). The OWL2 direct semantics are set-theoretic, with every class interpreted as a set, so this is a valid approach. When drawing Venn diagrams, sets are circles, and one circle being enclosed by another denotes subsetting. If circles overlap, this indicates set overlap, and if no overlap is shown the sets are assumed disjoint (have no members in common).

Let’s look at what happens when an ontology developer makes a SubClassOf link between PN and N. They may believe they are saying something like this:

Screen Shot 2018-09-03 at 5.12.16 PM

i.e. implicitly indicating that there are some nerves that are not peripheral nerves.

But in fact the OWL SubClassOf operator is interpreted set-theoretically as subset-or-equal-to (i.e. ) which can be visually depicted as:

Screen Shot 2018-09-03 at 5.13.03 PM

In this case our ontology developer wants to exclude the latter as a possibility (even if we end up with a model in which these two are equivalent, the ontology developer needs to arise at this conclusion by having the incoherencies in their own internal model revealed).

To make this explicit, there needs to be an additional class declared that (1) is disjoint from PN and (2) is a subtype of Nerve. We can think of this as a ProperSubClassOf axiom, which can be depicted visually as:

Screen Shot 2018-09-03 at 5.13.44 PM

If we encode this on our test ontology:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve

We can see that the ontology is inferred to be incoherent. There is no need for an additional post-hoc check: the generic incoherence detection mechanism of ROBOT does not need any special behavior, and the ontology editor sees all problematic classes in red, and can navigate to all problems by looking under owl:Nothing:

Screen Shot 2018-09-03 at 5.14.43 PM

Of course, we don’t want to manually assert this all the time, and litter our ontology with dreaded “OtherFoo” classes. If we can make the assumption that all asserted SubClassOfs are intended to be ProperSubClassOfs, then we can just do this procedurally as part of the ontology validation pipeline.

One way to do this is to inject a sibling for every class-parent pair and assert that the siblings are disjoint.

The following SPARQL will generate the disjoint siblings (if you don’t know SPARQL don’t worry, this can all be hidden for you):


prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt;
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
prefix owl: <http://www.w3.org/2002/07/owl#&gt;
CONSTRUCT {
?sibClass a owl:Class ;
owl:disjointWith ?c ;
rdfs:subClassOf ?p ;
rdfs:label ?sibLabel
}
WHERE {
?c rdfs:subClassOf ?p .
FILTER(isIRI(?c))
FILTER(isIRI(?p))
FILTER NOT EXISTS { ?c owl:deprecated "true"^^xsd:boolean }
OPTIONAL {
?c rdfs:label ?clabel
BIND(concat("DISJOINT-SIB-OF ", ?clabel) AS ?sibLabel)
}
BIND (UUID() as ?sibClass)
}

Note that we exclude deprecated/obsolete classes. The generated disjoint siblings are given a random UUID, and the label DISJOINT-SIB-OF X. You could also opt for the simpler “Other X” as in the above example, it doesn’t matter, only the ontology developer sees this, and only when debugging.

This can be encoded in a workflow, such that the axioms are injected as part of a test procedure. You likely do not want these axioms to leak out into the release version and confuse people.

Future versions of ROBOT may include a convenience function for doing this, but fow now you can do this in your Makefile:


SRC = pn.omn
disjoint_sibs.owl: $(SRC)
robot relax -i $< query --format ttl -c construct-disjoint-siblings.sparql $@
test.owl: $(SRC) disjoint_sibs.owl
robot merge -i $< -i disjoint_sibs.owl -o $@

Debugging Ontologies using OWL Reasoning. Part 1: Basics and Disjoint Classes axioms

This is the first part in a series on pragmatic techniques for debugging ontologies. See also part 2

All software developers are familiar with the concept of debugging, a process for finding faults in a program. The term ‘bug’ has been used in engineering since the 19th century, and was used by Grace Hopper to describe a literal bug gumming up the works of the Mark II computer. Since then, debugging and debugging tools have become ubiquitous in computing, and the modern software developer is fortunate enough to have a large array of tools and techniques at their disposal. These include unit tests, assertions and interactive debuggers.

original bug
The original bug

Ontology development has many parallels with software development, so it’s reasonable to assume that debugging techniques from software can be carried over to ontologies. I’ve previously written about use of continuous integration in ontology development, and it is now standard to use Travis to check pull requests on ontologies. Of course, there are important differences between software and ontology development. Unlike typical computer programs, ontologies are not executed, so the concept of an interactive debugger stepping through an execution sequence doesn’t quite translate to ontologies. However, there are still a wealth of tooling options for ontology developers, many of which are under-used.

There is a great deal of excellent academic material on the topic of ontology debugging; see for example the 2013 and 2014 proceedings of the excellently named Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM), or the seminal Debugging OWL Ontologies. However, many ontology developers may not be aware of some of the more basic ‘blue collar’ techniques in use for ontology debugging.

Using OWL Reasoning and disjointness axioms to debug ontologies

In my own experience one of the most effective means of finding problems in ontologies is through the use of OWL reasoning. Reasoning is frequently used for automated classification, and this is supported in tools such as ROBOT through the reason command. In addition to classification, reasoning can also be used to debug an ontology, usually by inferring if the ontology is incoherent. The term ‘incoherent’ isn’t a value judgment here; it’s a technical term for an ontology that is either inconsistent or contains unsatisfiable classes, as described in this article by Robert Stevens, Uli Sattler and Phillip Lord.

A reasoner will not find bugs without some help from you, the ontology developer.

Screen Shot 2018-08-02 at 5.22.05 PM

You have to impart some of your own knowledge of the domain into the ontology in order for incoherency to be detected. This is usually done by adding axioms that constrain the space of what is possible. The ontogenesis article has a nice example using red blood cells and the ‘only’ construct. I will give another example using the DisjointClasses axiom type; in my experience, working on large inter-related ontologies disjointness axioms are one of the most effective ways of finding bugs (and has the added advantage of being within the profile of OWL understood by Elk).

Let’s take the following example, a slice of an anatomical ontology dealing with cranial nerves. The underlying challenge here is the fact that the second cranial nerve (the optic nerve) is not in fact a nerve as it is part of the central nervous system (CNS), whereas true nerves as part of the peripheral nervous system (PNS). This seeming inconsistency has plagued different anatomy ontologies.

Ontology: <http://example.org>
Prefix: : <http://example.org/>
ObjectProperty: part_of
Class: CNS
Class: PNS
Class: StructureOfPNS EquivalentTo: part_of some PNS
Class: StructureOfCNS EquivalentTo: part_of some CNS
DisjointClasses: StructureOfPNS, StructureOfCNS
Class: Nerve SubClassOf: part_of some PNS
Class: CranialNerve SubClassOf: Nerve
Class: CranialNerveII SubClassOf: CranialNerve, part_of some CNS

cns-pns-disjoint

You may have noted this example uses slightly artificial classes of the form “Structure of X”. These are not strictly necessary, we’ll return to this when we discuss Generic Class Inclusion (GCI) axioms in a future part.

If we load this into Protege, switch on the reasoner, we will see that CranialNerveII shows up red, indicating it is unsatisfiable, rendering the ontology incoherent. We can easily find all unsatisfiable classes under the ‘Nothing’ builtin class on the inferred hierarchy view. Clicking on the ‘?’ button will make Protege show an explanation, such as the following:

Screen Shot 2018-08-02 at 5.28.59 PM

This shows all the axioms that lead to the conclusion that CranialNerveII is unsatisfiable. At least one of these axioms must be wrong (for example, the assumption that all cranial nerves are nerves may be terminologically justified, but could be wrong here; or perhaps it is the assumption that CN II is actually a cranial nerve; or we may simply want to relax the constraint and allow spatial overlap between peripheral and central nervous system parts). The ontology developer can then set about fixing the ontology until it is coherent.

Detecting incoherencies as part of a workflow

Protege provides a nice way of finding ontology incoherencies, and of debugging them by examining explanations. However, it is still possible to accidentally release an incoherent ontology, since the ontology editor is not compelled to check for unsatisfiabilities in Protege prior to saving. It may even be possible for an incoherency to be inadvertently introduced through changes to an upstream dependency, for example, by rebuilding an import module.

Luckily, if you are using ROBOT to manage your release process, then it should be all but impossible for you to accidentally release an incoherent ontology. This is because the robot reason command will throw an error if the ontology is incoherent. If you are using robot as part of a Makefile-based workflow (as configured by the ontology starter kit) then this will block progression to the next step, as ROBOT returns with a non-zero exit code when performing a reasoner operation on an incoherent ontology. Similarly, if you are using Travis-CI to vet pull requests or check the current ontology state, then the travis build will automatically fail if an incoherency is encountered.

robot-workflow

ROBOT reason flow diagram. Exiting with system code 0 indicates success, non-zero failure.

Running robot reason on our example ontology yields:

$ robot reason -r ELK -i cranial.omn
ERROR org.obolibrary.robot.ReasonerHelper - There are 1 unsatisfiable classes in the ontology.
ERROR org.obolibrary.robot.ReasonerHelper -     unsatisfiable: http://example.org/CranialNerveII

Generating debug modules – incoherent SLME

Large ontologies can strain the limits of the laptop computers usually used to develop ontologies. It can be useful to make something analogous to a ‘core dump’ in software debugging — a standalone minimal component that can be used to reproduce the bug. This is a module extract (using a normal technique like SLME) seeded by all unsatisfiable classes (there may be multiple). This provides sufficient axioms to generate all explanations, plus additional context.

I use the term ‘unsatisfiable module’ for this artefact. This can be done using the robot reason command with the “–debug-unsatisfiable” option.

In our Makefiles we often have a target like this:

debug.owl: my-ont.owl
        robot reason -i  $< -r ELK -D $@

If the ontology is incoherent then “make debug.owl” will make a small-ish standalone file that can be easily shared and quickly loaded in Protege for debugging. The ontology will be self-contained with no imports – however, if the axioms come from different ontologies in an import chain, then each axiom will be annotated with the source ontology, making it easier for you to track down the problematic import. This can be very useful for large ontologies with multiple dependencies, where there may be different versions of the same ontology in different import chains. 

Coming up

The next article will deal with the case of detecting unwanted equivalence axioms in ontologies, and future articles in the series will deal with practical tips on how best to use disjointness axioms and other constraints in your ontologies.

Carry on reading: Part 2, Unintentional Entailed Equivalence

Further Reading

Acknowledgments

Thanks to Nico Matentzoglu for comments on a draft of this post.

Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

GO annotation origami: Folding and unfolding class expressions

With the introduction of Gene Association Format (GAF) v2, curators are no longer restricted to pre-composed GO terms – they can use a limited form of anonymous OWL Class Expressions of the form:

GO_Class AND (Rel_1 some V_1) AND (Rel_2 some V2)

The set of relationships is specified in column 16 of the GAF file.

However, many tools are not capable of using class expressions – they discard the additional information leaving only the pre-composed GO_Class.

Using OWLTools it is possible to translate a GAF-v2 set of associations and an ontology O to an equivalent GAF-v1 set of associations plus an analysis ontology O-ext. The analysis ontology O-ext contains the set of anonymous class expressions folded into named classes, together with equivalence axioms, and pre-reasoned into a hierarchy using Elk.

See http://code.google.com/p/owltools/wiki/AnnotationExtensionFolding

For example, given a GO annotation of a gene ‘geneA’:

gene: geneA
annotation_class:  GO:0006915 ! apoptosis
annotation_extension: occurs_in(CL:0000700) ! dopaminergic neuron

The folding process will generate a class with a non-stable URI, automatic label and equivalence axiom:

Class: GO/TEMP_nnnn
  Annotations: label "apoptosis and occurs_in some dopaminergic neuron"
  EquivalentTo: 'apoptosis' and occurs_in some 'dopaminergic neuron'
  SubClassOf: 'neuron apoptosis'

This class will automatically be placed in the hierarchy using the reasoner (e.g. under ‘neuron apoptosis’). For the reasoning step to achieve optimal results, the go-plus-dev.owl version should be used (see new GO documentation). A variant of this step is to perform folding to find a more specific subclass that the one used for direct annotation.

The reverse operation – unfolding – is also possible.  For optimal results, this relies on Equivalent Classes axioms declared in the ontology, so make sure to use the go-plus-dev.owl. Here an annotation to a pre-composed complex term (eg neuron apoptosis) is replaced by an annotation to a simpler GO term (eg apoptosis) with column 16 filled in (e.g. occurs_in(neuron).

The folding operation allows legacy tools to take some advantage of GO annotation extensions by generating an ‘analysis ontology’ (care must be taken in how this is presented to the user, if at all). Ideally more tools will use OWL as the underlying ontology model and be able to handle c16 annotations directly, ultimately requiring less pre-coordination in the GO.

 

Querying for connections between the GO and FMA

Can we query for connections between FMA and GO? This should be
possible by using a combination of

  • GO
  • Uberon
  • FMA
  • Axioms linking GO and Uberon (x-metazoan-anatomy)
  • Axioms linking FMA and Uberon (uberon-to-fma)

This may seem like more components than is necessary. However,
remember that GO is a multi-species ontology, and “heart development”
in GO covers not only vertebrate hearts, but also (perhaps
controversially) drosophila “hearts”. In contrast, the FMA class for
“heart” represents a canonical adult human heart. This is why we have
to go via Uberon, which covers similar taxonomic territory to GO. The
uberon class called “heart” covers all hearts.

GO to metazoan anatomical structures

http://purl.obolibrary.org/obo/go/extensions/x-metazoan-anatomy.owl contains axioms of the form:


'heart  EquivalentTo 'anatomical structure morphogenesis' and
'results in morphogenesis of' some uberon:heart

(note that sub-properties of ‘results in developmental progression of’
are used here)

Generic metazoan anatomy to FMA

http://purl.obolibrary.org/obo/uberon/bridge/uberon-bridge-to-fma.owl contains axioms of the form:


fma:heart EquivalentTo uberon:heart and part_of some 'Homo sapiens'

GO to FMA

Note that there is no existential dependence between go ‘heart
development’ and fma:heart. This is as it should be – if there were no
human hearts then there would still be heart development
processes. This issue is touched in Chimezie Ogbuji‘s presentation at DILS 2012.

This lack of existential dependence has consequences for querying
connections. An OWL query for:

?p SubClassOf ‘results in developmental progression of’ some ?u

Will return GO-Uberon connections only.

We must perform a join in order to get what we want:

?p SubClassOf ‘results in developmental progression of’ some ?u,
?a SubClassOf ?u,
?a part_of some ‘Homo sapiens’

Actually executing this query is not straightforward. Ideally we would
have a way of using OWL syntax, such as the above. To get complete
results, either EL++ or RL reasoning is required. In the next post I’ll present some possible options for issuing this query.

Elk disjoint hack

Elk is a blindingly fast EL++ reasoner. Unfortunately, it doesn’t yet support the full EL++ profile – in particular it lacks disjointness axioms. This is unfortunate, as these kinds of axioms are incredibly useful for integrity checking. See the methods section of the Uberon paper for some details on how partwise disjointness axioms were created.

However, Elk does support intersection and equivalence. This means we should be able to perform a translation:

DisjointClasses(x1, x2, …, xn) ⇒
EquivalentClasses(owl:Nothing IntersectionOf(xi xj)) for all i<j<=n

I asked about this on the Elk mail list – see  Satisfiability checking and DisjointClasses axioms

The problem is that whilst Elk supports intersection and equivalence, it doesn’t support Nothing. This means that there may be corner cases in which it doesn’t work.

Proper disjointness support may be coming in the next version Elk, but it’s been a few months so I decided to go ahead and implement the above translation in OWLTools (also available in Oort).

If we have an ontology such as foo.owl:

Ontology: <http://example.org/x.owl>

Class: :reasoner
Class: :animal
  DisjointWith: :reasoner

Class: :elk
  SubClassOf: :reasoner, :animal

We can translate it using owltools:

owltools foo.owl --translate-disjoints-to-equivalents -o file://`pwd`/foo-x.owl

Remeber, ordering of arguments is significant in owltools -make sure you translate *after* the ontology is loaded.

And then load this into Protege and reason over it using Elk. As expected, “elk” is unsatisfiable:

You can also do the checking directly in owltools:

owltools foo.owl --translate-disjoints-to-equivalents --run-reasoner -r elk -u

The “-u” option will check for unsatisfiable classes and exit with a nonzero code if any are found, allowing this to be used within a CI system like Jenkins (see this previous post).

You can also use this transform within Oort (command line version only):

ontology-release-runner --translate-disjoints-to-equivalents --reasoner elk foo.owl

Remember, there are corner cases where this translation will not work. Nevertheless, this can be useful as part of an “early warning” system, backed up by slower guaranteed checks running in the background with HermiT or some other reasoner.

Perhaps the ontologies I work with have a simpler structure, but so far I have found this strategy to be successful, identifying subtle part-disjointness problems, and not giving any false positives. There don’t appear to be any scalability problems, with Elk being its usual zippy self even when uberon is loaded with ncbitaxon/taxslim and taxon constraints translated into Nothing-axioms (~3000 disjointness axioms).