Proposed strategy for semantics in RDF* and Property Graphs

Update 2020-09-12: I created a GitHub repo that concretizes part of the proposal here https://github.com/cmungall/owlstar

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z

Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)

proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>>  owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png

Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png

protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)

Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

Debugging Ontologies using OWL Reasoning. Part 2: Unintentional Entailed Equivalence

This is part in a series on pragmatic techniques for debugging ontologies. This follows from part 1, which covered the basics of debugging using disjointness axioms using Protege and ROBOT.
In the previous part I outlined basic reasoner-based debugging using Protege and ROBOT. The goal was to detect and debug incoherent ontologies.

One potential problem that can arise is the inference of equivalence between two classes, where the equivalence is unintentional. The following example ontology from the previous post illustrates this:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS

In this case PeripheralNerve and Nerve are entailed to be mutually equivalent. You can see this in Protege, as the two classes are grouped together with an equivalence symbol linking them:

Screen Shot 2018-09-03 at 5.19.47 PM

As the explanation shows, the two classes are equivalent because (1) PNs are defined as any nerve in the PNS, and (2) nerve is asserted to be in the PNS.

We assume here that this is not the intent of the ontology developer; we assume they created distinct classes with distinct names as they believe them to be distinct. (Note that some ontologies such as SWEET employ equivalence axioms to denote two distinct terms that mean the same thing, but for this article we assume OBO-style ontology development).

When the ontology developer sees inferences like this, they will likely want to take some corrective action:

  • Under one scenario, the inference reveals to the ontology developer that in fact nerve and peripheral nerve are the same concept, and thus the two classes should be merged, with the label from one being retained as the synonym of the other.
  • Under the other scenario, the ontology developer realizes the concept they have been calling ‘Nerve’ encompasses more general neuron projection bundles found in the CNS; here they may decide to rename the concept (e.g. neuron projection bundle) and to eliminate or broaden the part_of axiom.

So far so good. But the challenge here is that an ontology with entailed equivalencies between pairs of classes is formally coherent: all classes are satisfiable, and there are no inconsistencies. It will not be caught by a pipeline that detects incoherencies such as unsatisfiable classes. This means you may end up accidentally releasing an ontology that has potentially serious biological problems. It also means we can’t use the same technique described in part 1 to make a debug module.

Formally we can state this as there being no unique class assumption in OWL. By creating two classes, c1 and c2, you are not saying that there is something that differentiates these, even if it is your intention that they are different.

Within the OBO ecosystem we generally strive to avoid equivalent named classes (principle of orthogonality). There are known cases where equivalent classes join two ontologies (for example, GO cell and CL cell), in general when we find additional entailed pairs of equivalent classes not originally asserted, it’s a problem. I would hypothesize this is frequently true of non-OBO ontologies too.

Detecting unintended equivalencies with ROBOT

For the reasons stated above, ROBOT has configurable behavior for when it encounters equivalent classes. This can be controlled via the –equivalent-classes-allowed (shorthand: “-e”) option on the reason command. There are 3 options here:

  • none: any entailed equivalence axiom between two named classes will result in an error being thrown
  • all: permit all equivalence axioms, entailed or asserted
  • asserted-only: permit entailed equivalence axioms only if they match an asserted equivalence axiom, otherwise throw an error

If you are unsure of what to do it’s always a good idea to start stringent and pass ‘none’. If it turns out you need to maintain asserted equivalencies (for example, the GO/CL ‘cell’ case), then you can switch to ‘asserted-only’.

The ‘all’ option is generally too permissive for most OBO ontologies. However, for some use cases this may be selected. For example, if your ontology imports multiple non-orthogonal ontologies plus bridging axioms and you are using reasoning to find new equivalence mappings.

For example, on our peripheral nerve ontology, if we run

robot reason -e asserted-only -r elk -i pn.omn

We will get:


ERROR org.obolibrary.robot.ReasonOperation - Only equivalent classes that have been asserted are allowed. Inferred equivalencies are forbidden.
ERROR org.obolibrary.robot.ReasonOperation - Equivalence: <http://example.org/Nerve&gt; == <http://example.org/PeripheralNerve&gt;

ROBOT will also exit with a non-zero exist code, ensuring that your release pipeline fails fast, preventing accidental release of broken ontologies.

Debugging false equivalence

This satisfies the requirement that potentially false equivalence can be detected, but how does the ontology developer debug this?

A typical Standard Operating Procedure might be:

  • IF robot fails with unsatisfiable classes
    • Open ontology in Protege and switch on Elk
    • Go to Inferred Classification
    • Navigate to Nothing
    • For each class under Nothing
      • Select the “?” to get explanations
  • IF robot fails with equivalence class pairs
    • Open ontology in Protege and switch on Elk
    • For each class reported by ROBOT
      • Navigate to class
      • Observe the inferred equivalence axiom (in yellow) and select ?

There are two problems with this SOP, one pragmatic and the other a matter of taste.

The pragmatic issue is that there is a Protege explanation workbench bug that sometimes renders Protege unable to show explanations for equivalence axioms in reasoners such as Elk (see this ticket). This is fairly serious for large ontologies (although for our simple example or for midsize ontologies use of HermiT may be perfectly feasible).

But even in the case where this bug is fixed or circumvented, the SOP above is suboptimal in my opinion. One reason is that it is simply more complicated: in contrast to the SOP for dealing with incoherent classes, it’s necessary to look at reports coming from outside Protege, perform additional seach and lookup. The more fundamental reason is the fact that the ontology is formally coherent even though it is defying my expectations to follow the unique class assumption. It is more elegant if we can directly encode my unique class assumption, and have the ontology be entailed to be incoherent when this is violated. That way we don’t have to bolt on additional SOP instructions or additional ad-hoc programmatic operations.

And crucially, it means the same ‘logic core dump’ operation described in the previous post can be used in exactly the same way.

Approach: SubClassOf means ProperSubClassOf

My approach here is to make explicit the assumption: every time an ontology developer asserts a SubClassOf axiom, they actually mean ProperSubClassOf.

To see exactly what this means, it helps to think in terms of Venn diagrams (Venn diagrams are my go-to strategy for explaining even the basics of OWL semantics). The OWL2 direct semantics are set-theoretic, with every class interpreted as a set, so this is a valid approach. When drawing Venn diagrams, sets are circles, and one circle being enclosed by another denotes subsetting. If circles overlap, this indicates set overlap, and if no overlap is shown the sets are assumed disjoint (have no members in common).

Let’s look at what happens when an ontology developer makes a SubClassOf link between PN and N. They may believe they are saying something like this:

Screen Shot 2018-09-03 at 5.12.16 PM

i.e. implicitly indicating that there are some nerves that are not peripheral nerves.

But in fact the OWL SubClassOf operator is interpreted set-theoretically as subset-or-equal-to (i.e. ) which can be visually depicted as:

Screen Shot 2018-09-03 at 5.13.03 PM

In this case our ontology developer wants to exclude the latter as a possibility (even if we end up with a model in which these two are equivalent, the ontology developer needs to arise at this conclusion by having the incoherencies in their own internal model revealed).

To make this explicit, there needs to be an additional class declared that (1) is disjoint from PN and (2) is a subtype of Nerve. We can think of this as a ProperSubClassOf axiom, which can be depicted visually as:

Screen Shot 2018-09-03 at 5.13.44 PM

If we encode this on our test ontology:


ObjectProperty: part_of
Class: PNS
Class: Nerve SubClassOf: part_of some PNS
Class: PeripheralNerve EquivalentTo: Nerve and part_of some PNS
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve
Class: OtherNerve SubClassOf: Nerve DisjointWith: PeripheralNerve

We can see that the ontology is inferred to be incoherent. There is no need for an additional post-hoc check: the generic incoherence detection mechanism of ROBOT does not need any special behavior, and the ontology editor sees all problematic classes in red, and can navigate to all problems by looking under owl:Nothing:

Screen Shot 2018-09-03 at 5.14.43 PM

Of course, we don’t want to manually assert this all the time, and litter our ontology with dreaded “OtherFoo” classes. If we can make the assumption that all asserted SubClassOfs are intended to be ProperSubClassOfs, then we can just do this procedurally as part of the ontology validation pipeline.

One way to do this is to inject a sibling for every class-parent pair and assert that the siblings are disjoint.

The following SPARQL will generate the disjoint siblings (if you don’t know SPARQL don’t worry, this can all be hidden for you):


prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt;
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
prefix owl: <http://www.w3.org/2002/07/owl#&gt;
CONSTRUCT {
?sibClass a owl:Class ;
owl:disjointWith ?c ;
rdfs:subClassOf ?p ;
rdfs:label ?sibLabel
}
WHERE {
?c rdfs:subClassOf ?p .
FILTER(isIRI(?c))
FILTER(isIRI(?p))
FILTER NOT EXISTS { ?c owl:deprecated "true"^^xsd:boolean }
OPTIONAL {
?c rdfs:label ?clabel
BIND(concat("DISJOINT-SIB-OF ", ?clabel) AS ?sibLabel)
}
BIND (UUID() as ?sibClass)
}

Note that we exclude deprecated/obsolete classes. The generated disjoint siblings are given a random UUID, and the label DISJOINT-SIB-OF X. You could also opt for the simpler “Other X” as in the above example, it doesn’t matter, only the ontology developer sees this, and only when debugging.

This can be encoded in a workflow, such that the axioms are injected as part of a test procedure. You likely do not want these axioms to leak out into the release version and confuse people.

Future versions of ROBOT may include a convenience function for doing this, but fow now you can do this in your Makefile:


SRC = pn.omn
disjoint_sibs.owl: $(SRC)
robot relax -i $< query --format ttl -c construct-disjoint-siblings.sparql $@
test.owl: $(SRC) disjoint_sibs.owl
robot merge -i $< -i disjoint_sibs.owl -o $@

Debugging Ontologies using OWL Reasoning. Part 1: Basics and Disjoint Classes axioms

This is the first part in a series on pragmatic techniques for debugging ontologies. See also part 2

All software developers are familiar with the concept of debugging, a process for finding faults in a program. The term ‘bug’ has been used in engineering since the 19th century, and was used by Grace Hopper to describe a literal bug gumming up the works of the Mark II computer. Since then, debugging and debugging tools have become ubiquitous in computing, and the modern software developer is fortunate enough to have a large array of tools and techniques at their disposal. These include unit tests, assertions and interactive debuggers.

original bug
The original bug

Ontology development has many parallels with software development, so it’s reasonable to assume that debugging techniques from software can be carried over to ontologies. I’ve previously written about use of continuous integration in ontology development, and it is now standard to use Travis to check pull requests on ontologies. Of course, there are important differences between software and ontology development. Unlike typical computer programs, ontologies are not executed, so the concept of an interactive debugger stepping through an execution sequence doesn’t quite translate to ontologies. However, there are still a wealth of tooling options for ontology developers, many of which are under-used.

There is a great deal of excellent academic material on the topic of ontology debugging; see for example the 2013 and 2014 proceedings of the excellently named Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM), or the seminal Debugging OWL Ontologies. However, many ontology developers may not be aware of some of the more basic ‘blue collar’ techniques in use for ontology debugging.

Using OWL Reasoning and disjointness axioms to debug ontologies

In my own experience one of the most effective means of finding problems in ontologies is through the use of OWL reasoning. Reasoning is frequently used for automated classification, and this is supported in tools such as ROBOT through the reason command. In addition to classification, reasoning can also be used to debug an ontology, usually by inferring if the ontology is incoherent. The term ‘incoherent’ isn’t a value judgment here; it’s a technical term for an ontology that is either inconsistent or contains unsatisfiable classes, as described in this article by Robert Stevens, Uli Sattler and Phillip Lord.

A reasoner will not find bugs without some help from you, the ontology developer.

Screen Shot 2018-08-02 at 5.22.05 PM

You have to impart some of your own knowledge of the domain into the ontology in order for incoherency to be detected. This is usually done by adding axioms that constrain the space of what is possible. The ontogenesis article has a nice example using red blood cells and the ‘only’ construct. I will give another example using the DisjointClasses axiom type; in my experience, working on large inter-related ontologies disjointness axioms are one of the most effective ways of finding bugs (and has the added advantage of being within the profile of OWL understood by Elk).

Let’s take the following example, a slice of an anatomical ontology dealing with cranial nerves. The underlying challenge here is the fact that the second cranial nerve (the optic nerve) is not in fact a nerve as it is part of the central nervous system (CNS), whereas true nerves as part of the peripheral nervous system (PNS). This seeming inconsistency has plagued different anatomy ontologies.

Ontology: <http://example.org>
Prefix: : <http://example.org/>
ObjectProperty: part_of
Class: CNS
Class: PNS
Class: StructureOfPNS EquivalentTo: part_of some PNS
Class: StructureOfCNS EquivalentTo: part_of some CNS
DisjointClasses: StructureOfPNS, StructureOfCNS
Class: Nerve SubClassOf: part_of some PNS
Class: CranialNerve SubClassOf: Nerve
Class: CranialNerveII SubClassOf: CranialNerve, part_of some CNS

cns-pns-disjoint

You may have noted this example uses slightly artificial classes of the form “Structure of X”. These are not strictly necessary, we’ll return to this when we discuss Generic Class Inclusion (GCI) axioms in a future part.

If we load this into Protege, switch on the reasoner, we will see that CranialNerveII shows up red, indicating it is unsatisfiable, rendering the ontology incoherent. We can easily find all unsatisfiable classes under the ‘Nothing’ builtin class on the inferred hierarchy view. Clicking on the ‘?’ button will make Protege show an explanation, such as the following:

Screen Shot 2018-08-02 at 5.28.59 PM

This shows all the axioms that lead to the conclusion that CranialNerveII is unsatisfiable. At least one of these axioms must be wrong (for example, the assumption that all cranial nerves are nerves may be terminologically justified, but could be wrong here; or perhaps it is the assumption that CN II is actually a cranial nerve; or we may simply want to relax the constraint and allow spatial overlap between peripheral and central nervous system parts). The ontology developer can then set about fixing the ontology until it is coherent.

Detecting incoherencies as part of a workflow

Protege provides a nice way of finding ontology incoherencies, and of debugging them by examining explanations. However, it is still possible to accidentally release an incoherent ontology, since the ontology editor is not compelled to check for unsatisfiabilities in Protege prior to saving. It may even be possible for an incoherency to be inadvertently introduced through changes to an upstream dependency, for example, by rebuilding an import module.

Luckily, if you are using ROBOT to manage your release process, then it should be all but impossible for you to accidentally release an incoherent ontology. This is because the robot reason command will throw an error if the ontology is incoherent. If you are using robot as part of a Makefile-based workflow (as configured by the ontology starter kit) then this will block progression to the next step, as ROBOT returns with a non-zero exit code when performing a reasoner operation on an incoherent ontology. Similarly, if you are using Travis-CI to vet pull requests or check the current ontology state, then the travis build will automatically fail if an incoherency is encountered.

robot-workflow

ROBOT reason flow diagram. Exiting with system code 0 indicates success, non-zero failure.

Running robot reason on our example ontology yields:

$ robot reason -r ELK -i cranial.omn
ERROR org.obolibrary.robot.ReasonerHelper - There are 1 unsatisfiable classes in the ontology.
ERROR org.obolibrary.robot.ReasonerHelper -     unsatisfiable: http://example.org/CranialNerveII

Generating debug modules – incoherent SLME

Large ontologies can strain the limits of the laptop computers usually used to develop ontologies. It can be useful to make something analogous to a ‘core dump’ in software debugging — a standalone minimal component that can be used to reproduce the bug. This is a module extract (using a normal technique like SLME) seeded by all unsatisfiable classes (there may be multiple). This provides sufficient axioms to generate all explanations, plus additional context.

I use the term ‘unsatisfiable module’ for this artefact. This can be done using the robot reason command with the “–debug-unsatisfiable” option.

In our Makefiles we often have a target like this:

debug.owl: my-ont.owl
        robot reason -i  $< -r ELK -D $@

If the ontology is incoherent then “make debug.owl” will make a small-ish standalone file that can be easily shared and quickly loaded in Protege for debugging. The ontology will be self-contained with no imports – however, if the axioms come from different ontologies in an import chain, then each axiom will be annotated with the source ontology, making it easier for you to track down the problematic import. This can be very useful for large ontologies with multiple dependencies, where there may be different versions of the same ontology in different import chains. 

Coming up

The next article will deal with the case of detecting unwanted equivalence axioms in ontologies, and future articles in the series will deal with practical tips on how best to use disjointness axioms and other constraints in your ontologies.

Carry on reading: Part 2, Unintentional Entailed Equivalence

Further Reading

Acknowledgments

Thanks to Nico Matentzoglu for comments on a draft of this post.

Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

GO annotation origami: Folding and unfolding class expressions

With the introduction of Gene Association Format (GAF) v2, curators are no longer restricted to pre-composed GO terms – they can use a limited form of anonymous OWL Class Expressions of the form:

GO_Class AND (Rel_1 some V_1) AND (Rel_2 some V2)

The set of relationships is specified in column 16 of the GAF file.

However, many tools are not capable of using class expressions – they discard the additional information leaving only the pre-composed GO_Class.

Using OWLTools it is possible to translate a GAF-v2 set of associations and an ontology O to an equivalent GAF-v1 set of associations plus an analysis ontology O-ext. The analysis ontology O-ext contains the set of anonymous class expressions folded into named classes, together with equivalence axioms, and pre-reasoned into a hierarchy using Elk.

See http://code.google.com/p/owltools/wiki/AnnotationExtensionFolding

For example, given a GO annotation of a gene ‘geneA’:

gene: geneA
annotation_class:  GO:0006915 ! apoptosis
annotation_extension: occurs_in(CL:0000700) ! dopaminergic neuron

The folding process will generate a class with a non-stable URI, automatic label and equivalence axiom:

Class: GO/TEMP_nnnn
  Annotations: label "apoptosis and occurs_in some dopaminergic neuron"
  EquivalentTo: 'apoptosis' and occurs_in some 'dopaminergic neuron'
  SubClassOf: 'neuron apoptosis'

This class will automatically be placed in the hierarchy using the reasoner (e.g. under ‘neuron apoptosis’). For the reasoning step to achieve optimal results, the go-plus-dev.owl version should be used (see new GO documentation). A variant of this step is to perform folding to find a more specific subclass that the one used for direct annotation.

The reverse operation – unfolding – is also possible.  For optimal results, this relies on Equivalent Classes axioms declared in the ontology, so make sure to use the go-plus-dev.owl. Here an annotation to a pre-composed complex term (eg neuron apoptosis) is replaced by an annotation to a simpler GO term (eg apoptosis) with column 16 filled in (e.g. occurs_in(neuron).

The folding operation allows legacy tools to take some advantage of GO annotation extensions by generating an ‘analysis ontology’ (care must be taken in how this is presented to the user, if at all). Ideally more tools will use OWL as the underlying ontology model and be able to handle c16 annotations directly, ultimately requiring less pre-coordination in the GO.

 

Querying for connections between the GO and FMA

Can we query for connections between FMA and GO? This should be
possible by using a combination of

  • GO
  • Uberon
  • FMA
  • Axioms linking GO and Uberon (x-metazoan-anatomy)
  • Axioms linking FMA and Uberon (uberon-to-fma)

This may seem like more components than is necessary. However,
remember that GO is a multi-species ontology, and “heart development”
in GO covers not only vertebrate hearts, but also (perhaps
controversially) drosophila “hearts”. In contrast, the FMA class for
“heart” represents a canonical adult human heart. This is why we have
to go via Uberon, which covers similar taxonomic territory to GO. The
uberon class called “heart” covers all hearts.

GO to metazoan anatomical structures

http://purl.obolibrary.org/obo/go/extensions/x-metazoan-anatomy.owl contains axioms of the form:


'heart  EquivalentTo 'anatomical structure morphogenesis' and
'results in morphogenesis of' some uberon:heart

(note that sub-properties of ‘results in developmental progression of’
are used here)

Generic metazoan anatomy to FMA

http://purl.obolibrary.org/obo/uberon/bridge/uberon-bridge-to-fma.owl contains axioms of the form:


fma:heart EquivalentTo uberon:heart and part_of some 'Homo sapiens'

GO to FMA

Note that there is no existential dependence between go ‘heart
development’ and fma:heart. This is as it should be – if there were no
human hearts then there would still be heart development
processes. This issue is touched in Chimezie Ogbuji‘s presentation at DILS 2012.

This lack of existential dependence has consequences for querying
connections. An OWL query for:

?p SubClassOf ‘results in developmental progression of’ some ?u

Will return GO-Uberon connections only.

We must perform a join in order to get what we want:

?p SubClassOf ‘results in developmental progression of’ some ?u,
?a SubClassOf ?u,
?a part_of some ‘Homo sapiens’

Actually executing this query is not straightforward. Ideally we would
have a way of using OWL syntax, such as the above. To get complete
results, either EL++ or RL reasoning is required. In the next post I’ll present some possible options for issuing this query.

Elk disjoint hack

Elk is a blindingly fast EL++ reasoner. Unfortunately, it doesn’t yet support the full EL++ profile – in particular it lacks disjointness axioms. This is unfortunate, as these kinds of axioms are incredibly useful for integrity checking. See the methods section of the Uberon paper for some details on how partwise disjointness axioms were created.

However, Elk does support intersection and equivalence. This means we should be able to perform a translation:

DisjointClasses(x1, x2, …, xn) ⇒
EquivalentClasses(owl:Nothing IntersectionOf(xi xj)) for all i<j<=n

I asked about this on the Elk mail list – see  Satisfiability checking and DisjointClasses axioms

The problem is that whilst Elk supports intersection and equivalence, it doesn’t support Nothing. This means that there may be corner cases in which it doesn’t work.

Proper disjointness support may be coming in the next version Elk, but it’s been a few months so I decided to go ahead and implement the above translation in OWLTools (also available in Oort).

If we have an ontology such as foo.owl:

Ontology: <http://example.org/x.owl>

Class: :reasoner
Class: :animal
  DisjointWith: :reasoner

Class: :elk
  SubClassOf: :reasoner, :animal

We can translate it using owltools:

owltools foo.owl --translate-disjoints-to-equivalents -o file://`pwd`/foo-x.owl

Remeber, ordering of arguments is significant in owltools -make sure you translate *after* the ontology is loaded.

And then load this into Protege and reason over it using Elk. As expected, “elk” is unsatisfiable:

You can also do the checking directly in owltools:

owltools foo.owl --translate-disjoints-to-equivalents --run-reasoner -r elk -u

The “-u” option will check for unsatisfiable classes and exit with a nonzero code if any are found, allowing this to be used within a CI system like Jenkins (see this previous post).

You can also use this transform within Oort (command line version only):

ontology-release-runner --translate-disjoints-to-equivalents --reasoner elk foo.owl

Remember, there are corner cases where this translation will not work. Nevertheless, this can be useful as part of an “early warning” system, backed up by slower guaranteed checks running in the background with HermiT or some other reasoner.

Perhaps the ontologies I work with have a simpler structure, but so far I have found this strategy to be successful, identifying subtle part-disjointness problems, and not giving any false positives. There don’t appear to be any scalability problems, with Elk being its usual zippy self even when uberon is loaded with ncbitaxon/taxslim and taxon constraints translated into Nothing-axioms (~3000 disjointness axioms).

 

Taxon constraints in OWL

A number of years ago, the Gene Ontology database included such curiosities as:

  • A slime mold gene that had a function in fin morphogenesis
  • Chicken genes that were involved in lactation

These genes would be pretty fascinating, if they actually existed. Unfortunately, these were all annotation errors, arising from a liberal use of inference by sequence similarity.

We decided to adopt a formalism specified by Wacek Kusnierczyk[1], in which we placed taxon constraints on classes in the ontology, and used these to detect annotation errors[2].

The taxon constraints make use of two relations:

 

You can see examples of usage in GO either in QuickGO (e.g. lactation) , or by opening the x-taxon-importer.owl ontology in Protege. This ontology is used in the GO Jenkins environment to detect internal consistencies in the ontology.

The same relations are also in use in another multi-species ontology, Uberon[3].

 

In uberon, the constraints are used for ontology consistency checking, and to provide taxon subsets – for example, aves-basic.owl, which excludes classes such as mammary gland, pectoral fin, etc.

Semantics of the shortcut relations

In the Deegan et al paper we described a rule-based procedure for using the taxon constraint relations. This has the advantage of being scalable over large taxon ontologies and large gene association sets. But a better approach is to encode this directly as owl axioms and use a reasoner. For this we need to use OWL axioms directly, and we need to choose a particular way of representing a taxonomy.

Both relations make use of a class-based representation of a taxonomy such as ncbitaxon.owl or a subset such as taxslim.owl.

We can treat the taxon constraint relations as convenient shortcut relations which ‘expand’ to OWL axioms that capture the intended semantics in terms of a standard ObjectProperty “in_organism”. For now we leave in_organism undefined, but the basic idea is that for anatomical structures and cell components “in_organism” is the part_of parent that is an organism, whereas for processes it is the organism that encodes the gene products that execute the process.

In fact there are two ways to expand to the “in_organism” class axioms:

The more straightforward way:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism only ?Y
?X never_in_taxon ?Y ===> ?X SubClassOf in_organism only not ?Y

To achieve the desired entailments, it is necessary for sibling taxa to be declared disjoint (e.g. Eubacteria DisjointWith Eukaryota). Note that these disjointness axioms are not declared in the default NCBITaxon translation.

A different way which has the advantage of staying within the OWL2-EL subset:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism some ?Y
?X never_in_taxon ?Y ===> ?X DisjointWith in_organism some ?Y

This requires all sibling nodes (A,B) in the NCBI taxonomy to have a
General Axiom:

in_organism some ?A DisjointWith in_organism some ?B

These general axioms are automatically generated and available in taxslim-disjoint-over-in-taxon.owl

Taxon groupings

GO also makes use of taxon groupings – these include new classes such as “prokaryotes” which are defined using UnionOf axioms.. They are available in go-taxon-groupings.owl.

Taxon modules

One of the uses of taxon constraints is to build taxon-specific subsets of ontologies. This will be covered in a future post.

References

  1. Waclaw Kusnierczyk (2008) Taxonomy-based partitioning of the Gene Ontology, Journal of Biomedical Informatics
  2. Deegan Née Clark, J. I., Dimmer, E. C., and Mungall, C. J. (2010). Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development. BMC Bioinformatics 11, 530
  3. Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E., and Haendel, M. A. (2012) Uberon, an integrative multi-species anatomy ontology Genome Biology 13, R5. http://genomebiology.com/2012/13/1/R5

Ontologies and Continuous Integration

Wikipedia describes http://en.wikipedia.org/wiki/Continuous_integration as follows:

In software engineering, continuous integration (CI) implements continuous processes of applying quality control — small pieces of effort, applied frequently. Continuous integration aims to improve the quality of software, and to reduce the time taken to deliver it, by replacing the traditional practice of applying quality control after completing all development.

This description could – or should – apply equally well to ontology engineering, especially in contexts such as the OBO Foundry, where ontologies are becoming increasingly interdependent.

Jenkins is a web based environment for running integration checks. Sebastian Bauer, in Peter Robinson’s group had the idea of adapting Jenkins for performing ontology builds rather than software builds (in fact he used Hudson, but the differences between Hudson and Jenkins are minimal). He used  OORT as the tool to build the ontology — Oort takes in one or more ontologies in obo or owl, runs some non-logical and logical checks (via your choice of reasoner) and then “compiles” downstream ontologies in obo and owl formats. Converting to obo takes care of a number of stylistic checks that are non-logical and wouldn’t be caught by a reasoner (e.g. no class can have more than one text definition).

We took this idea and built our own Jenkins ontology build environment, adding ontologies that were of relevance to the projects we were working on. This turned out to be extraordinarily easy – Jenkins is very easy to install and configure, help is always just a mouse click away.

Here’s a screenshot of the main Jenkins dashboard. Ontologies have a blue ball if the last build was successful, a red ball otherwise. The weather icon is based on the “outlook” – lots of successful builds in a row gets a sunny icon. Every time an ontology is committed to a repository it triggers a build (we try and track the edit version of the ontology rather than the release version, so that we can provide direct feedback to the ontology developer). Each job can be customized – for example, if ontology A depends on ontology B, you might want to trigger a build of A whenever a new version of B is committed, allowing you to be forewarned if something in B breaks A.

main jenkins view

Below is a screenshot for the configuration settings for the go-taxon build – this is used to check if there are violations on the GO taxon constraints (dx.doi.org/10.1016/j.jbi.2010.02.002). We also include an external ontology of disjointness axioms (for various reasons its hard to include this in the main GO ontology). You can include any shell commands you like – in principle it would be possible to write a jenkins plugin for building ontologies using Oort, but for now you have to be semi-familiar with the command line and the Oort command line options:

config

Often when a job fails the Oort output can be a little cryptic – generally the protocol is to do detailed investigation using Protege and a reasoner like HermiT to track down the exact problem.

The basic idea is very simple, but works extremely well in practice. Whilst it’s generally better to have all checks performed directly in the editing environment, this isn’t always possible where multiple interdependent ontologies are concerned. The Jenkins environment we’ve built has proven popular with ontology developers, and we’d be happy to add more ontologies to it. It’s also fairly easy to set up yourself, and I’d recommend doing this for groups developing or using ontologies in a mission crticial way.

UPDATE: 2012-08-07

I uploaded some slides on ontologies and continuous integration to slideshare.

UPDATE: 2012-11-09

The article Continuous Integration of Open Biological Ontology Libraries is available on the Bio-Ontologies SIG KBlog site.

The size of Richard Nixon’s nose, part I

Consider a simple model of Richard Nixon:

Individual: :nixon
Types: :Organism
Facts: :has_part :nixons_nose

Individual: :nixons_nose
Types: :nose
Facts: :has_characteristic :nixons_nose_size

Individual: :nixons_nose_size
Types: :big

nixon haspart nose hasquality size

here’s the relations in our background ontology:

ObjectProperty: :has_part
Characteristics: Transitive

ObjectProperty: :has_characteristic
InverseOf:
:characteristic_of

ObjectProperty: :characteristic_of
InverseOf:
:has_characteristic

We have 3 entities: Nixon, his nose, and the characteristic or quality that is Richard Nixon’s nose size. We follow BFO here in individuating qualities: thus even if I had a nose of the “same” size as Richard Nixon, we would not share the same nose-size quality instance, we would each have our own unique nose-size quality instance (for a nice treatment, see Neuhaus et al [PDF]).

Now let’s look at a phenotypic label such as “big nose”. Intuitively we can see that this applies to Richard Nixon. But where exactly in this instance graph is the big nose phenotype? Is it the nose, the size, or Richard himself?

Specifically, if we have a phenotype ontology with a term “increased size of nose” or “big nose”, what OWL class expression do we assign as an equivalent? We have to make a decision as to where to root the path through our instance graph. It might be:

  • The nose: ie nose and has_characteristic some big
  • The size:  i.e. big and characteristic_of some nose
  • The organism: i.e. has_part some (nose and has_characteristic some big)
  • some unspecified thing that has a relationship to one of the above

The structure OWL class expression can be visualized as a path through the nixon graph:

Our decision affects the classification we get from reasoning. A big nose is part of a funny face, but in contrast a person with a big nose is a subclass of a person with a funny face. If you then put your reasoner results into a phenotype analysis you might get different results.

To an ordinary common sense person whose brain hasn’t been infected by ontologies, the difference between a “a nose that is increased in size” and an “increased size of nose” or a “person with a nose that’s increased in size” is just linguistic fluff, but the distinctions are important from an ontology modeling perspective.

Nevertheless, we may want to formalize the fact that we don’t care about these distinctions – we might want our “big nose” phenotype class to be any of the above.

One way would be to make fugly union classes, but this is tedious.

There is another way. We can introduce a generic “exhibits” relation. We elide a textual definition for now, the idea is that this relation captures the general notion of having a phenotype:

ObjectProperty: :exhibits
SubPropertyChain: :exhibits o :has_part
SubPropertyChain: :exhibits o :has_characteristic
SubPropertyChain: :exhibits o :characteristic_of
Characteristics: Transitive

We make this is super-relation of has_part:

ObjectProperty: :has_part
SubPropertyOf: :exhibits
Characteristics: Transitive

We can see exhibits is very promiscuous – when it connects to other relations, it makes a new exhibits relation.

How let’s make some probe classes illustrating the different ways we could define our “don’t care where we root the graph” phenotype:

Class: :test1
EquivalentTo: :exhibits some (:big and :characteristic_of some :nose)

Class: :test2
EquivalentTo: :exhibits some (:has_part some (:nose and :has_characteristic some :big))

Class: :test3
EquivalentTo: :exhibits some (:nose and :has_characteristic some :big)

Class: :test4
EquivalentTo: :has_part some (:nose and :has_characteristic some :big)

After running the reasoner we get the following inferred hierarchy:

-- test1=test3
---- test2
---- test4

So we can see we are collapsing the distinction between  “increased size of nose” and “nose that is increased in size” by instead defining a class “exhibiting an increased size of nose”.

If we then try the DL-query tab in Protege, we can see that the individual “nixon” satisfies all of these expressions.

Why is this important? It means we can join and analyze datasets without performing awkward translations. Group 1 can take a quality-centric approach, Group 2 can take an entity-centric approach, the descriptions or data from either of these groups will classify under the common “exhibits phenotype” class.

This works because of the declared inverse between has characteristic and characteristic of. Graphically we can think of this as “doubling back”:

Unfortunately, inverses put us outside EL++, so we can’t use the awesome Elk for classification.

Not-caring in ontologies is hard work!

What if we want to care even less, and formally have a “big nose phenotype” class classify either nixon, his nose, or the bigness that inheres in his nose? That’s the subject of the next post, together with some answers to the bigger question of “what is a phenotype”.