Proposed strategy for semantics in RDF* and Property Graphs

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z

Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)

proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png

Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png

protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)

Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

 

 

 

 

 

 

Advertisements

OntoTip: Write simple, concise, clear, operational textual definitions

This is a post in a series of tips on ontology development, see the parent post for more details.

Ontologies contain both textual definitions (aimed primarily at humans) and logical definitions (aimed primarily at machines). There is broad agreement that textual definitions are highly important (they are an OBO principle), and the utility of logical definitions has been shown for both ontology creation/maintenance (see previous post) as well as for analytic applications. However, there has been insufficient attention paid to the crafting of definitions, and to addressing questions such as how textual and logical definitions inter-relate, leading to a lot of inconsistent practice across OBO ontologies. 

Mungalls-Ontology-Design-Guidelines (3)

text definitions are for consumption by biocurators and domain scientists, logical definitions for machines. Logical definition here shown in OWL Manchester syntax, with units written as human-readable labels in quotes. Note the correspondence between logical and textual definitions.

Two people who have thought deeply about this are Selja Seppälä and Alan Ruttenberg. They organized the  2016 International Workshop on Definitions in Ontologies (IWOOD 2016), and I will lift a quote directly from the website here:

Definitions of terms in ontologies serve a number of purposes. For example, logical definitions allow reasoners to assist in and verify classification, lessening the development burden and enabling expressive queries. Natural language (text) definitions allow humans to understand the meaning of classes, and can help ameliorate low inter-annotator agreement. Good definitions allow for non-experts and experts in adjacent disciplines to understand unfamiliar terms making it possible to confidently use terms from external ontologies, facilitating data integration. 

Despite the importance of definitions in ontologies, developers often have little if any training in writing definitions and axioms, as shown in Selja Seppälä and Alan Ruttenberg, Survey on defining practices in ontologies: Report, July 2013. This leads to varying definition practices and inconsistent definition quality. Worse, textual and logical definitions are often left out of ontologies altogether. 

I would also state that poorly constructed textual definitions can have severe long term ramifications. They can introduce cryptic ambiguities or misunderstandings that may not be uncovered for years, at which point they necessitate expensive ontology repair and re-curation efforts. My intent in this post is not to try and impose my own stylistic quirks on everyone else, but to improve the quality of engineering in ontologies, and to improve the lives of curators using definitions for their daily work.

There is an excellent follow-up paper Guidelines for writing definitions in ontologies by Seppälä, Smith, and Ruttenberg (henceforth referred to as the SRS paper), which should be required reading for anyone who is involved in building ontologies. The authors provide a series of guidelines based on their combined ontology development expertise and empirical work on surveying usage and attitudes.

While there is potentially an aspect of personal preference and stylistic preference in crafting text, I think that their guidelines are eminently sensible and deserve further exposure and adoption. I recommend reading the full paper. Here I will look at a subset of these, and give my own informal take on them. In their paper, SRS use a numbering system for their guidelines. I prefix their numbering system with S, and will go through them in a different order.

I have transcribed the guidelines to a table here, with the guidelines I discuss here in bold:

S1 Conform to conventions
S1.1 Harmonize definitions
S2 Principles of good practice
S3 Use the genus differentia form
S3.1 Include exactly one genus
S3.1.1 Use the genus proximus
S3.1.2 Avoid plurals
S3.1.3 Avoid conjunctions and disjunctions
S3.1.4 Avoid categorizers
S4 Avoid use/mention confusion
S5 Include necessary, and whenever possible, jointly sufficient conditions
S5.1 Avoid encyclopedia information
S5.2 Avoid negative terms
S5.3 Avoid definitions by extension
S6 Adjust the scope
S6.1 Definition should be neither too broad nor too narrow
S6.2 Define only one thing with a single textual definition
S7 Avoid circularity
S8 Include jointly satisfiable features
S9 Use appropriate degree of generality
S9.1 Avoid generalizing exprressions
S9.2 Avoid examples and lists
S9.3 Avoid indexical and dialectic terms
S9.4 Avoid subjective and evaluative statements
S10 Define abbreviations and acronyms
S11 Match text and logical definitions
S11.1 Proofread definitions

Concisely state necessary and sufficient conditions, cut the chit-chat

Cut_the_Crap

Listen to The Clash: cut the c**p

Combining S6.1 “A definition should be neither too broad nor too narrow” with S9.4 “avoid subjective and evaluative statements”, I would choose to emphasize that textual definitions should concisely encapsulate necessary and sufficient conditions, avoiding weasel words, irrelevant verbiage, chit-chat and random blethering. This makes it easier for a reader to hone in on the intended meaning of the class. It also encourages a standard style (S1), which can make it easier for others to write definitions when creating new classes. It also makes it easier to be consistent with the logical definition, when provided (S11; see below). 

SRS provide this example under S9.4:

cranberry bean: Also called shell bean or shellout, and known as borlotti bean in Italy, the cranberry bean has a large, knobby beige pod splotched with red. The beans inside are cream- colored with red streaks and have a delicious nutlike flavor. Cranberry beans must be shelled before cooking. Heat diminishes their beautiful red color. They’re available fresh in the summer and dried throughout the year (FOODON_03411186)

While this text contains potentially useful information, this is not a good operational definition, it lacks easy to apply objective criteria to determine what is and what is not a member of this class.

If you need to include discursive text, use either the definition gloss or a separate description field. The ‘gloss’ is the part of the text definition that comes after the first period/full-stop. A common practice in the GO is to recapitulate the definition of the differentia in the gloss. For example, the definition for ‘ectoderm development’ is

The process whose specific outcome is the progression of the ectoderm over time, from its formation to the mature structure. In animal embryos, the ectoderm is the outer germ layer of the embryo, formed during gastrulation.”.

(embedded ‘ectoderm’ definition underlined)

This suffers some problems as it violates DRY (if the wording of the definition of ectoderm changes, then the wording of the definition of ‘ectoderm development’ changes). However, it provides utility as users do not have to traverse the elements of the OWL definition to achieve the bigger picture. It is marginally easier to semi-automatically update the gloss, compared to the situation where the redundant information permeates the core text definition. 

When the conventions for a particular ontology allow for gloss, it is important to be consistent about how this is used, and to include only necessary and sufficient conditions before the period. Recently in GO we were puzzling over what was included and excluded in the following definition:

An apical plasma membrane part that forms a narrow enfolded luminal membrane channel, lined with numerous microvilli, that appears to extend into the cytoplasm of the cell. A specialized network of intracellular canaliculi is a characteristic feature of parietal cells of the gastric mucosa in vertebrates

It is not clear if parietal cells are included as an exemplar, or if this is intended as a necessary condition. S5.1 “avoid encyclopedic information” is excellent advice. This recommends putting examples of usage in a dedicated field. Unfortunately the practice of including examples in definitions is common because many curation tools limit which fields are shown, and examples can help curators immensely. I would therefore compromise on this advice and say that IF examples are to be included in the definition field, THEN this MUST be included in the gloss (after the necessary and sufficient conditions, separated by a period), AND it should be clearly indicated as an example. GO uses the string “An example of this {process,component,…} is found in …” to indicate an example.

Genus-differentia definitions are your friend

(S3)

Mungalls-Ontology-Design-Guidelines (4).png

Genus-differentia definitions are your friend.

In the introduction, SRS define a ‘classic definition’ as one following genus-differentia style i.e. “a G that D”. The precise lexical structure can be modified for readability, but the important part is to state differentiating characteristics from a generic superclass

The example in the paper is the Uberon definition of skeletal ligament: “Dense regular connective tissue connecting two or more adjacent skeletal elements”. Here the genus is “dense regular connective tissue” (which should be the name of a superclass in the ontology; not necessarily the direct parent post-reasoning) and the differentiating characteristics are property of “connecting two or more adjacent skeletal elements” (which is also expressed via relationships in the ontology). As it happens, this definition violates one of the other principles as we should say later.

I agree enthusiastically with S3 “Use the genus-differentia form”. (Note that this should not be confused with elevation of single-inheritance as desired property in released ontologies; see this post)

The genus-differentia definition should be both necessary (i.e. the genus and the characteristics hold for all instances of the class) and sufficient (i.e. anything that satisfies the genus and characteristics must be an instance of the class).

Genus-differentia definitions encourage modularity and reuse. We can construct an ontology in a modular fashion, reusing simpler concepts to fashion more complex concepts.

Genus-differentia form is an excellent way to ensure definitions are operational. The set of all genus-differentia definitions form a decision tree, we can work up or down the tree to determine if an observation falls into an ontology class.

I also agree with S3.1 “include exactly one genus”. SRS give the example in OBI of

recombinant vector: “A recombinant vector is created by a recombinant vector cloning process”

which omits a genus (it could be argued that a more serious issue is the practice of defining an object in terms of its creation process rather than vice versa).

In fact, omission of a genus is often observed in logical definitions too, and is usually the result of an error, and will give unintended results in reasoning. I chose the following example from CLO (reported here):

http://purl.obolibrary.org/obo/CLO_0000266 immortal uterine cervix-derived cell line cell

This is wrong because a reasoner will classify anything that comes from a cervix as being a cell line!

In a rare disagreement with SRS, I have a slight issue with S3.1.1 “use the genus proximus”, i.e. use the closest parent term, but I cover this in a future post. Using the closest parent can lead to redundancy and violations of DRY. 

Avoid indexicals (S9.3)

Quoting SRS’ wording for S9.3:

Avoid indexical and deictic terms, such as ‘today’, ‘here’, and ‘this’ when they refer to (the context of ) the author of the definition or the resource itself. Such expressions often indicate the presence of a non-defining feature or a case of use/mention confusion. Most of the times, the definition can be edited and rephrased in a more general way

Here is a bad disease definition for a fictional disease (adapted from a real example): “A recently discovered disease that affects the anterior diplodocus organ…”. Don’t write definitions like this. This is obviously bad as it will become outdated and your ontology will look sad. If the date of discovery is important, include an annotation assertion for date of discovery (or better yet, a field for originating publication, which entails a date). But it’s more likely this is unnecessary verbiage that detracts from the business of precisely communicating the meaning of the class (S9.4).

Conform to conventions (S1)

As well as following natural language conventions and conventions of the domain of the ontology, it’s good to follow conventions, if not across ontologies, at least within the same ontology.

Do not replicate the name of the class in the definition

An example is a hypothetical definition for ‘nucleus’

A nucleus is a membrane-bounded organelle that …

This violates DRY and is not robust to changes in the name. Under S1.1 this is stated as “limiting the definition to the definiens”, alternatively states as “avoid including the definiendum and copula”.  If you really must include the name (definiendum), do this consistently throughout the ontology rather than ad-hoc. But I strongly recommend not to, and to start the text of the definition with the string “A <genus> that …”.

Here is another bad made-up definition for a fictional disease (based on real examples):

Spattergroit (also known as contagious purple pustulitis) is a highly contagious disease caused by…”.

Including a synonym in the definition violates DRY, and will lead to inconsistency if the synonym becomes a class in its own right. Remember, we are not writing encyclopedic descriptions, but ontology definitions. Information such as synonyms can go in dedicated fields (where they can be used computationally, and presented appropriately to the user).

S11 Match Textual and Logical Definitions

The OWL definition (aka logical definition, aka equivalence axiom), when it exists, should correspond in some broad sense to the text definition. This does not mean that it should be a literal transcription of the OWL. On the contrary, you should always avoid strange computerese conventions in text intended for humans (this includes the use of IDs in text, connecting_words_with_underscoresOrCamelCase, use of odd characters, as well as strange unwieldy grammatical forms; see S1). It does mean that if your OWL definition veers wildly from your text then you have a bad smell you need to get rid of before visitors come around.

If your OWL definition doesn’t match your text definition, it is often a sign you are writing overly clever complex Boolean logic OWL definitions that don’t correspond to how domain scientists think about the class [covered in a future post]. Or maybe you are over-axiomatizing, and you should drop your equivalence axiom since on examination it’s not right (see the over-axiomatizing principle).

SRS provide one positive example, but no negative examples. The positive example is from IDO:

Screen Shot 2019-07-06 at 1.50.53 PM.png

Positive example from IDO: bacteremia: An infection that has as part bacteria located in the blood. Matches the logical def of infection and (has_part some
(infectious agent and Bacteria and (located_in some blood)))

Unfortunately, there are many cases where text and logical definitions deviate. An example reported for OBI is oral administration:

The administration of a substance into the mouth of an organism”

the text def above is considerably different from the logical one:

EquivalentTo (realizes some material to be added role) and (realizes some (target of material addition role and (role of some mouth)))

Use of DOSDPs can help here, as a standard textual definition form here can be generated for classes with OWL definitions. One thing that would be useful would be a tool that could help spot cases where the text definition and logical definition have veered widely.

Summary

I was able to write this post by cribbing from the SRS paper (Seppala et al) which I strongly recommend reading. Even if you don’t agree with everything in either the paper or my own take, I think it’s important if the ontology community discuss some of these and reach some kind of consensus on which principles to apply when.

Of course, there will always be an element of subjectivity and stylistic preference that will be harder to agree on. When making recommendations here there is the danger of being perceived as the ‘ontology police’. But I think there is a core set of common-sense principles that help with making ontologies more usable, consistent, and maintainable. My own experience strongly suggests that when this advice is not heeded, we end up with costly misannotation due to differing interpretations of terms, and many other issues.

I would like OBO to play more of a role in the process of coming up with these guidelines, and on evaluating their usage in existing ontologies. Stay tuned for more on this, and please provide feedback on what you think!

OntoTip: Learn the Rector Normalization technique

This is a post in a series of tips on ontology development, see the parent post for more details.

(Note there is an excellent introduction to this topic in the ontogenesis blog)

The 2003 paper Modularisation of Domain Ontologies Implemented in Description Logics and related formalisms including OWL by Alan Rector lays out in very clear terms a simple methodology for building and maintaining compositional ontologies in a modular and maintainable fashion. From the introduction, the paper “concentrates specifically on the engineering issues of robust modular implementation in logic based formalisms such as OWL”.

Anyone involved with the authoring of ontologies should read this paper, and should strive to build modular, normalized ontologies from the outset.

The motivation for the paper is the observation that when ontologies grow beyond a certain size become increasingly hard to maintain, because polyhierarchies (i.e hierarchies where classes can have more than one parent) become increasingly “tangled”, leading to errors and high maintenance cost. This observation was based on medical ontologies such as GALEN and SNOMED, but at the time the paper came out, this was already true for many OBO ontologies such as the GO, as well as various phenotype and trait ontologies. One property all these ontologies share is their ‘compositional nature’, where more complex concepts are built up from more basic ones. 

OE.png

Figure: Example of difficult-to-maintain, tangled polyhierarchy, taken from the Drosophila anatomy ontology. Figure taken from OBO to OWL slides by David Osumi-Sutherland.

The methodology for “untangling” these is to decompose the domain ontology into  simpler (“primitive”) ontologies, which can then be recombined using logical definitions and other axioms, and to infer the polyhierarchy using reasoning. Note that for end-users the is-a structure of the ontology remains the same. However, for the ontology maintainers, maintenance cost is much lower. This is illustrated in the paper which demonstrates the methodology using an example chemical entity hierarchy, see figure below:

normalization

Rector 2003, Figure 1. Original tangled polyhierarchy on the left (multiple parents indicated with “^”). Normalized “primitive skeleton” trees in top left, logical axioms on bottom right. The 3 bars means “equivalent to”, these are logical definitions providing necessary and sufficient conditions. The arrows indicate subClassOf, i.e. necessary conditions. The original hierarchy can be entirely recreated using the skeleton taxonomies, domain axioms, through the use of a reasoner.

 

Rector calls this approach implementation normalization. The concept of database normalization should be familiar to anyone who has had to create or maintain relational database schemas (one example of how patterns from software and database engineering translate to construction of ontologies; see previous post). 

 

From the paper:

The fundamental goal of implementation normalisation is to achieve explicitness and modularity in the domain ontology in order to support re-use, maintainability and evolution. These goals are only possible if:

 

  • The modules to be re-used can be identified and separated from the whole

  • Maintenance can be split amongst authors who can work independently
  • Modules can evolve independently and new modules be added with minimal side effects
  • The differences between different categories of information are represented explicitly both for human authors’ understanding and for formal machine inference.

 

Rector describes five features of ontology languages that are needed to support normalized design:

 

  • Primitive concepts described by necessary conditions

 

  • Defined concepts defined by necessary & sufficient conditions
  • Properties which relate concepts and can themselves be placed in a subsumption hierarchy.
  • Restrictions constructed as quantified role -concept pairs, e.g. (restriction hasLocation someValuesFrom Leg) meaning “located in some leg”.
  • Axioms which declare concepts either to be disjoint or to imply other concepts.

 

 

 

Some of the terms may seem unfamiliar due to terminological drift in the ontology world. Here ‘concepts’ are classes or terms, ‘necessary and sufficient conditions’ are sometimes called ‘logical definitions’, or equivalence axioms (represented using ‘intersection_of’ in obo syntax), ‘properties’ are relations (ObjectProperties in OWL), Quantified role-concept pairs are just simple relational class expressions (or simply “relationships” to those coming from OBO-Edit). 

The constructs described here are exactly the ones now used in ontologies such as the Gene Ontology and Phenotype Ontologies for construction of logical definitions (see 1, 2, 3). These ontologies have undergone (or are still undergoing) a process of ‘de-tangling’. Many ontologies now employ a prospective normalized development process, where classes are logically defined at the time of creation, and their placement inferred automatically. Examples include uPheno-compliant ontologies such as XPO, ZP, and PLANP. This has the advantage of requiring no retrospective de-tangling, thus saving on wasted effort.

In practice it’s rarely the case that we perfectly adhere to the normalization pattern. In particular, we rarely ‘bottom out’ at pure tree-based ontologies. Usually there is a chain of dependencies from more compositional to less compositional, with the terminal ontologies in the dependency tree being more tree-like. It should also be noted that the practice of normalization and avoidance of assertion of multiple is-a parents has sometimes been mistaken for a principle that multiple parents is bad. This misconception is addressed in a separate post

It is also considerably easier to do this than when Rector wrote this paper. Protege has seen numerous improvements. One game-changer was the advent of fast reasoners such as Elk which reasons over the EL subset of OWL (or a close enough approximation), which is sufficient to cover the constructs described in the Rector paper, and thus sufficient for basic normalization.

We also have a number of systems for creating normalized ontology classes using templates, where the template corresponds to a design pattern. These include Dead Simple OWL Design Patterns, ROBOT templates, TermGenie, and Tawny-OWL. These allow ontology developers to author logical definitions for classes without directly writing any OWL axioms. An ontologist defines the pattern in advance and the ontology developer can simply fill in ‘slots’.

biochem

Example template for GO biochemical process classes. From Hill et al (note some relations may have changed). Using a templating system, a curator needs only select the template (e.g. biosynthetic process) and a value for any template values (e.g. X=alanine, a class from ChEBI).

Where are we now?

The Rector 2003 paper ends with “we believe that if the potential of OWL and related DL based formalisms is to be realised, then such criteria for normalisation need to become well defined and their use routine”. Fast forward sixteen years to 2019, and the bio-ontology world is still engaged in a process of untangling ontologies. Although we have made good progress, some ontologies are still tangled, and many continue to assert hierarchies that could be inferred. Why? There may be a number of reasons for this, including:

It’s tempting to hack

hacker

It can be tempting to “hack” together an asserted hierarchy as opposed to constructing an ontology in a modular fashion. This is especially true for ontology developers who have not been trained in techniques like modular design.  We see this trade-off in software development all the time: the quick hack that grows to unmaintainable beast.

Retrospectively untangling gets harder the larger the ontology becomes. This is a good example of the  technical debt  concept from software engineering, outlined in the previous post. Even for well normalized ontologies, a tangled remnant remains, leading to ballooning of technical debt. 

We’re still not there with tools

Even where we have the tools, they are not universally used. ChEBI is one of the most widely used bio-ontologies, but it currently lacks logical definitions. This is in part because it is developed as a traditional database resource rather than an ontology. Curators used a specialized database interface that is optimized for things such as chemical structures, but lacks modern ontology engineering features such as authoring OWL axioms or integrating reasoning with curation.

Untangling is hard!

tele.png

Untangling can be really hard. Sometimes the untangling involves hierarchies that have been conceptually baked in for centuries, along with apparent contradictions. For example, the classification of the second cranial nerve as a nerve, being part of the central nervous system, and the classification of nerves as part of the peripheral nervous system (see the example in this post). Trying to tease apart different nuances into a well-behaved normalized ontologies can be hard.

It should be noted that of course not everything in biology is amenable to this kind of normalization. A common mistake is over-stating logical definitions (there will be a future post on this). Still, there are no shortage of cases of named concepts in the life sciences represented in ontologies that are trivially compositional and amenable to normalization.

Sociotechnological issues confound the problem

In principle the Rector criteria that “maintenance can be split amongst authors who can work independently” is a good one, but can lead to sociotechnological issues. For example, it is often the case that the larger domain ontologies that are the subject of untangling receive more support and funding than the primitive skeleton ontologies. This is not surprising as the kinds of concepts required for curators to annotate biological data will often be more compositional, and this closer to the ‘tip’ of an ontology dependency tree. Domain ontology developers accustomed to moving fast and needing to satisfy term-hungry curators will get frustrated if their requested changes in dependent module ontologies are not acted on quickly, necessitating “patches” in the domain ontology.

Another important point is that the people developing the domain ontology are often different from the people developing the dependent ontology, and may have valid differences in perspective. Even among willing and well-funded groups, this can take considerable effort and expertise to work through.

For me the paradigmatic example of this was the effort required to align perspectives between GO and ChEBI such that ChEBI could be used as a ‘skeleton’ ontology in the Rector approach.  This is described in Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology (Hill et al, 2013). 

For example, a nucleotide that contains a nucleobase, a sugar and at least one phosphate group would be described as a carbohydrate by a carbohydrate biochemist, who is primarily concerned with the reactivity of the carbohydrate moiety of the molecule, whereas general organic chemists might classify it as a phosphoric ester.

Both of the above classifications are correct chemically, but they can lead to incorrect inferences when extrapolated to the process hierarchy in the GO. Consider ‘nucleotide metabolic process’ (GO:0009117), ‘carbohydrate metabolic process’ (GO:0005975) and ‘phosphate metabolic process’ (GO:0006796) in the GO. If ‘nucleotide metabolic process’ (GO:0009117) were classified as both is _a‘carbohydrate metabolic process’ (GO:0005975) and is _a ‘phosphate metabolic process’ (GO:0006796) to parallel the structural hierarchy in ChEBI, then the process that results in the addition of a phosphate group to a nucleotide diphosphate would be misleadingly classified as is _a ‘carbohydrate metabolic process’ (GO:0005975). This is misleading because, since the carbohydrate portion of the nucleotide is not being metabolized, biologists would not typically consider this to be a carbohydrate metabolic process.

This situation was resolved by decreasing is-a overloading in ChEBI by the addition of the functional-parent relationship. But the process of understanding these nuanced differences and resolving them to everyone’s satisfaction can take considerable time and resources. Again, it is harder and more expensive to do this retrospectively, it’s always better to prospectively normalize.

Thankfully this was all resolved, and GO is able to leverage CHEBI as a skeleton ontology in the automatic classification of metabolic processes, as illustrated here:

Screen Shot 2019-06-28 at 5.24.16 PM

 

This is one of the motivations for the creation and ongoing operation of the Open Bio-Ontologies Foundry (OBO). One of the motivating factors in the creation of OBO was the recognized need for modular ontology construction and the coordination of the different modules. For me, one of the inspirations for OBO was my involvement in BioPerl development (BioPerl was one of a family of bioinformatics libraries, including BioPython and BioJava). At that time, there was a growing recognition that bioinformatics software was facing a crisis due to unmaintainable code (everyone from that time remembers the “quick perl script” that became enshrined as monolithic infrastructure). The ‘Bio*’ projects attempted to follow best software engineering practice and to break complex code into maintainable modules. Often those modules would be developed by distinct people, but the BioPerl maintainers ensured that these worked together as a cohesive whole. 

Summary

  • All ontology developers should familiarize themselves with the Rector 2003 paper
  • The approach described is particularly useful for larger ontologies that have a high number of compositional classes to manage. Note the vast majority of ontologies in OBO have branches that are to some degree explicitly or implicitly compositional.
  • The sooner you adopt this the better – retrospective normalization is harder than prospective.
  • Although the approach described is independent of particular technology choice, the adoption of explicit design patterns and a system such as DOSDP-tools, ROBOT templates, or Tawny-OWL is recommended. 
  • Sometimes the normalization approach involves identifying modules within your own ontology. Sometimes it requires use of an external ontology. This can pose challenges, but rather than give up without trying, use the OBO approach. File tickets on the external ontology tracker, communicate your requirements publicly (it is likely others have the same requirements).

OntoTip: Clearly document your design decisions

When building a bio-ontology, we frequently make design decisions regarding how we choose to model a particular aspect of the domain. Developing ontologies is not simply a matter of collecting terms and textual definitions, nor is it a matter of recording observations. It involves modeling decisions that reflect how we want to slice and dice the various different generalizations of biological phenomena. These modeling decisions frequently involve trade-offs between different use cases and other factors such as complexity of the ontology.  Sometimes these modeling decisions are made by individual ontology editors; sometimes they are made by a larger group, such as a committee or a content meeting combining domain experts and ontologists. Making these design decisions transparent is really important for making your ontology more usable, and more sustainable.

Model of a generalized eukaryotic cell. Bio-ontologists build models of biological entities and phenomena, such as the internal structure and components of a cell. This is guided both by the underlying ground-truth reality, and design decisions about how to carve up different parts, where to draw boundaries, and how best to generalize over variation in nature.

A note for people reading this from outside the OBO community: bio-ontologies are frequently large, involving thousands or tens of thousands of classes, with varying levels of axiomatization. They are commonly by biocurators for annotation of biological data, and entities such as genes or genomic features. Here ‘annotation’ means creating some kind of association between an entity of interest and an ontology class, typically with provenance and evidence information. These ontologies are usually built by biocurators with biology backgrounds, with broad knowledge of a particular domain. Some of the concerns may be different compared to some of the more ‘data model’ oriented ontologies found outside the life sciences and biomedicine, but some of the concerns may be the same.

Some examples of design decisions:

  • For a biological process or pathway, deciding on the starts and ends of a process (which constrains what the parts of a process can be). For example, does a signaling pathway start with the binding between a receptor activity and a ligand? Does it end with the activity of a transcription factor?
  • For brain regions, how should we draw the boundaries? E.g. does the hippocampus include the dentate gyrus? Do we include different classes to accommodate different groups’ boundary preferences (thus introducing complexity in both nomenclature and ontology structure) or choose to follow a particular standard or preferences of an individual group or researcher (potentially alienating or limiting the applicability of your ontology to these other groups)?
  • How do we represent the relationship between the PNS and the CNS? Do we allow overlap, or do we model as spatially disjoint? There are distinct consequences of each choice that may not be clear from the outset.
  • How should we represent structures such as a vertebra, which can exist in both cartilage form and bony form? (with variation potentially on an ontogenic/developmental axis, and potentially on a phylogenetic axis, e.g. sharks have cartilaginous skeletons). If we bake in assumptions drawn from fully formed humans (i.e. that vertebra is a subClassOf bone), this limits applicability to either developmental biology use cases, or comparative anatomy. In Uberon, we have an endochondral element design pattern, with a triad of structures: the composition-agnostic superclass, and  bony and cartilaginous subclasses. This ensures maximum applicability of the ontology, with annotators choosing the subclass that is appropriate to their organism/time stage. However it comes at some cost of nomenclature complexity, inflation of classes, and potential for annotators to accidentally select the wrong class
  • How should the different subtypes of skeletal tissue be modeled, where divisions can be along a continuum rather than discrete groups? How should the different skeletal elements be related to the tissue that composes them? Should we have distinct classes for ‘bone tissue’ and ‘bone element’?
  • How should environmental processes such as deforestation be linked to environmental physical entities such as forests? What relations should connect these, and what should the logical axioms for both look like?
  • How do we handle chemical entities such as citric acid and citrate which are formally chemically distinct, yet may be interchangeable from a biological perspective? See Hill et al.
  • Which upper ontology classes should be used (if any)? In order to represent the lumen of a subcellular organelle, do we model this as an immaterial entity (thus forcing this class to be in a different subclass hierarchy from the other parts, such as the membrane), or in the same material entity hierarchy? Some ontologies such as OGMS and IDO make use of a lot of different BFO classes, other disease ontologies use fewer (note there will be another post on this topic…)
Screen Shot 2019-06-15 at 3.14.31 PM.png

example of endochondral pattern in uberon. The vertebra exists in up to 3 states: pre-cartilage, cartilage, and bone (with the latter absence in cartilaginous fish). A generic “vertebral element” class captures the vertebra in a composition-agnostic grouping. Subclasses are defined using OWL equivalence axioms, e.g. ‘vertebra cartilage element’ = ‘vertebral element’ and ‘composed primarily of’ some ‘cartilage tissue’

Whatever the ontology and whatever the design decision you and your fellow editors make, I can guarantee that someone will not like that decision; or more frequently, fail to understand it. This often results in confusion and annoyance in trying to use an ontology. Annotators may be presented with two similar-sounding classes, and may not know the background and nuanced reasons you modeled it that way. This can result in frustration, and in inconsistency in how the ontology is applied (with some annotators opting for class A, and some for class B). Sometimes this inconsistency is not noticed for years after, substantial resources have been devoted to annotation. The resulting corpus is far less useful because of this inconsistency in usage. This is something you want to take immediate prospective steps to avoid happening.

Documenting design decisions in an easy to comprehend way is also vital for maintainability of an ontology. Maybe you are lucky to have mad ontology skillz and have thought deeply and very hard about your domain, and have an elaborate internal model of how everything fits together, and you can easily slot terms into the perfect place in the ontology with ease. If this is the case, pause reading this for now and read up about the Bus Factor. This is a concept originally drawn from software engineering — basically, if you get hit by a bus, then no one will be able to carry on the development of your ontology since all the key knowledge is in your head. I should stress this is a metaphorical bus, there is no actual bus driving around mowing down ontologists (although some may find it tempting).

If you document all design decisions it makes it easier for people to come on board and make edits to the ontology in ways that don’t introduce incoherencies. It makes it easier for annotators to understand your intent, reducing frustration, and making it less likely that the ontology is applied inconsistently.

Note that when I am talking about documentation here, I mean documentation in addition to well-formed textual definitions. The topic of writing good text definitions is deserving of its own post, and indeed a future post in this series will be dedicated entirely to definitions. While including good definitions is a necessary condition of a well-documented ontology, it’s a mistake to assume it’s sufficient. This is particularly true for ontologies that incorporate a lot of nuanced fine-grained distinctions. While these can seem like intricate Swiss watches to the designers, they may resemble Rube-Goldberg contraptions to some users.

gears-1334564_960_720

How an ontology looks to its designer

rube_goldberg27s_22self-operating_napkin22_28cropped29

How an ontology sometimes looks to users

Hopefully this has convinced you (or you were already convinced). So how should you go about documenting these decisions?

There is no one correct way, but I will provide some of my own recommendations here. I should also note that ontologies I work on often fall short of some of these. I will provide examples, from various different ontologies.

Manage your ontology documentation as embedded or external documents as appropriate

Documentation can either be embedded in the ontology, or external.

drawing for blog post

Example of embedded and external documentation. Box surrounded by dotted lines denotes the OWL file. The OWL file contains (1) annotations on classes with URL values point to external docs (2) annotations on classes with human-readable text as values (3) documentation axiom on a class, where the axiom is annotated with a URL (4) design pattern YAML, lives outside OWL file, but managed as text file in GitHub. DP framework generates axioms which can be auto-annotated with documentation axioms (5) external documentation, e.g. on a wiki. This contains more detailed narrative formatted text, with images, figures, examples, etc.

Embedded documentation is documentation contained in the ontology itself, usually as annotation assertion axioms, such as textual definitions, rdfs:comments, etc. (Note here I am using “annotation” in the OWL sense, rather than the sense of annotating data and biological entities using ontologies).

Embedded documentation “follows the ontology around”, e.g. it is present when people download the OWL, it should be visible in ontology browsers (although not all browsers show all annotation properties).

Embedded documentation is somewhat analogous to inline documentation in software development, but a key difference is that ontologies are not encapsulated in the same way; inline documentation in software is typically only visible to developers, not users. (The analogy fits better when thinking about inline documentation for APIs that gets exposed to API users). It is possible for an ontology to include embedded documentation that is only visible to ontology developers, by including a step in the ontology release workflow for stripping out internal definitions. See the point below about eliminating jargon.

External documentation is documentation that exists outside the ontology. It may be managed alongside the ontology as text documents inside your GitHub repo, and version controlled (you are using version control for your ontology aren’t you? If not, stop now and go back to the first post in this series!). Alternatively, it may be managed outside the repo, as a google doc, or in an external wiki. If you are using google docs to manage your documentation, then standard google doc management practice applies: keep everything well-organized in a folder rather than headless; use meaningful document titles (e.g. don’t call a doc “meeting”). Make all your documentation world-readable, and allow comments from a broad section of your community. If you are using mediawiki then categories are very useful, especially if the ontology documentation forms a part of a larger corpus or project documentation. Another choice is systems like readthedocs or mkdocs. If for some unfathomable reason you want to use Word docs, then obviously you should be storing these in version control or in the cloud somewhere (it’s easier to edit Word docs via google docs now), not on your hard drive or in an email attachments. For external documentation I would recommend something where it is easy to provide a URL that takes you to the right section of the text. A Word doc is less suited to this.

You could also explore various minipublication mechanisms. You could publish design documents using Zenodo, and get a DOI for them. This has some nice features such as easy tracking of different versions of a document, and providing more explicit attribution than something like a google doc. Sometimes actual peer-reviewed manuscripts can help serve as documentation; for example, the Vertebrate Skeleton Anatomy Ontology paper was written after an ontology content meeting involving experts in comparative skeletal anatomy and expert ontologists. However, peer-reviewed manuscripts are hard to write (and often take a long time to get reviewed for ontology papers). Even less non-peer reviewed manuscripts can be more time-intensive to write than less formal documentation. Having a DOI is not essential, it’s more important to focus on the documentation content itself and not get too tied to mechanism.

I personally like using markdown format for narrative text. It is easy to manage under version control, it is easy for people to learn and edit, the natively rendering in GitHub is nice, it can be easily converted to formats like HTML using pandoc, and works in systems like readthedocs, as well as GitHub tickets. Having a standard format allows for easy portability of documentation. Whatever system you are using, avoid ‘vendor lockin’. It should be easy to migrate your documentation to a new system. We learned this the hard way when googlecode shut down – the wiki export capabilities turned out not to capture everything, which we only discovered later on.

One advantage of external docs is that they can be more easily authored in a dedicated documented authoring environment. If you are editing embedded documentation as long chunks of text using Protege, then you have limited support for formatting the text or embedding images, and there is no guarantee about how formatting will be rendered in different systems.

However, the decoupling of external docs from the ontology itself can lead to things getting out of sync and getting stale. Keeping things in sync can be a maintenance burden. There is no ideal solution to this but it is something to be aware of.

An important class of documentation is structured templated design pattern specification files, such as DOSDPs or ROBOT templates. This will be a topic of a future post. The DOSDP YAML file is an excellent place to include narrative text describing a pattern, and the rationale for that pattern (see for example the carcinoma DP documentation in Mondo). These could be considered embedded, with the design pattern being a “metaclass” in the ontology, but it’s probably easier to consider these as external documentation. (in the future we hope to have better tools for compiling a DP down into human-friendly markdown or HTML, stay tuned).

Another concept from software development is literate programming. Here the idea is that the code is embedded in narrative text/documentation, rather than vice versa. This can be applied to ontology development as this paper from Phil Lord and Jennifer Warrender demonstrates. I think this is an interesting idea, but it still remains hard to implement for ontologies that rely on a graphical ontology editing environment like Protege, rather than coding an ontology using a system like Tawny-OWL.

Provide clear links from sections of the ontology to relevant external documentation

When should documentation be inlined/embedded, and when should it be managed externally? There is no right answer, but as a rule of thumb I would keep embedded docs to a few sentences per unit of documentation, with anything larger being managed externally. With external docs it’s easier to use formatting, embed figures, etc. Wherever you choose to draw the line, it’s important to embed links to external documentation in the ontology. It’s all very well having reams of beautiful documentation, but if it’s hard to find, or it’s hard to navigate from the relevant part of the ontology to the appropriate section of the documentation, then it’s less likely to be read by the people who need to read it. Ideally everyone would RTFM in detail, but in practice you should assume that members of your audience are incredibly busy and thus appreciate being directed to portions that most concern them.

The URLs you choose to serve up external documentation from should ideally be permanent. Anecdotally, many URLs embedded in ontologies are now dead. You can use the OBO PURL system to mint PURLs for your documentation.

Links to external documentation can be embedded in the ontology using annotation properties such as rdfs:seeAlso

For example, the Uberon class for the appendicular skeleton has a seeAlso link to the Uberon wiki page on the appendages and the appendicular skeleton.

Add documentation to individual axioms where appropriate

As well as class-level annotation, individual axioms can be annotated with URLs, giving an additional level of granularity. This can be very useful, for example to show why a particular synonym was chosen, or why a particular part-of link is justified.

A useful pattern is annotating embedded documentation with a link to external documentation that provides more details.

Unfortunately not all browsers render annotations on axioms, but this is something that can hopefully be resolved soon.

A “legacy documentation pattern” that you will see in some ontologies like GO is to annotate an annotation assertion with a CURIE-style identifier that denotes a content meeting. For example, the class directional locomotion has its text definition axiom annotated with a dbxref “GOC:mtg_MIT_16mar07”. On all browsers like AmiGO, OLS, and OntoBee this shows up as a string “GOC:mtg_MIT_16mar07”. Obviously this is pretty impenetrable to the average user, and this should actually link to the the relevant wiki page. We are actively working to fix this!

Don’t wait: document prospectively

The best time to document is as you are editing the ontology (or even beforehand). Documenting retrospectively is harder.

And remember, good documentation is often a love letter to your future self.

Run ontology content meetings and clearly document key decisions

The Gene Ontology (GO) has a history of running face-to-face ontology content meetings, usually based around a particular biological topic. During these meetings ontology developers experienced with the GO, annotators, and subject matter experts get together to thrash out new areas of the ontology, or improve existing areas. Many other ontologies do this too — for example, see table 1 from the latest HPO NAR paper.

Organization Location Focus
Undiagnosed Diseases Network (UDN); Stanford Center for Inherited Cardiovascular Diseases (SCICD) Stanford University, CA, USA (March 2017) Cardiology
European Reference Network for Rare Eye Disease (ERN-EYE) Mont Sainte-Odile, France (October 2017) Ophthalmology
National Institute of Allergy and Infectious Disease (NIAID) National Institutes of Health, Bethesda, MD, USA (May and July 2018) Allergy and immunology
Neuro-MIG European network for brain malformations (www.neuro-mig.org) St Julians, Malta; Lisbon, Portugal (February 2018; September 2018) Malformations of cortical development (MCD)
European Society for Immunodeficiencies (ESID) and the European Reference network on rare primary immunodeficiency, autoinflammatory and autoimmune diseases (ERN-RITA) Vienna Austria (September 2018) Inborn errors of immunity.

Community workshops and collaborations aimed at HPO content expansion and refinement (from Köhler et al 2019).

One thing that is lacking is a shared set of guidelines across OBO for running a successful content meeting. One thing that is important is to take good notes, make summaries of these, and link this to the relevant areas of the ontology.

A situation you want to avoid is ten years down the line needing to refactor some crucial area of the ontology, and having some vague recollection that you modeled as X rather than Y because that was the preference of the experts, but having no documentation on exactly why they preferred things this way.

Don’t present the user with impenetrable jargon

It is easy for groups of ontology developers to lapse into jargon, whether it is domain-specific jargon, ontology-jargon, or jargon related to their ontology development processes (is there a jargon ontology?).

As an example of ontology jargon, see some classes from BFO such as generically dependent continuant.

b is a generically dependent continuant = Def. b is a continuant that g-depends_on one or more other entities. (axiom label in BFO2 Reference: [074-001]) [http://purl.obolibrary.org/obo/bfo/axiom/074-001 ]

Although BFO is intended to be hidden from average users, it frequently ‘leaks’ for example through ontology imports.

Jargon can be useful as an efficient way for experts to communicate, but as far as possible this should be minimized, with the intended audience clearly labeled, and supplemental documentation for average users.

Pay particular attention to documenting key abstractions (and simplify where possible)

Sometimes ontology developers like to introduce abstractions that give them the ability to introduce finer-grained distinctions necessary for some use case. An example is the endochondral element example introduced earlier. This can introduce complexity into an ontology, so it’s particular important that these abstractions are well-documented.

One thing that consistently causes confusion in users who are not steeped in a particular mindset is proliferation of similar-seeming classes under different BFO categories. For example, having classes for a disease-as-disposition, a disorder-as-material-entity, a disease-course-as-process, a disease-diagnosis-as-data-item, etc. You can’t assume your users have read the BFO manual. It’s really important to document both your rationale for introducing these duplicative classes, and provide easy to consume documentation about how to select the appropriate class for different purposes.

Or perhaps you don’t actually need all of those different upper level categories in your ontology at all? This is the subject of a future post…

Sometimes less is more

More is not necessarily better. If a user has to wade through philosophical musings in order to get to the heart of the matter, they are less likely to actually read the docs.

Additionally, creating too much documentation can create a maintenance burden for yourself. This is especially true if the same information is communicated in multiple different places in the documentation.

Achieving the right balance can be hard. If you are too concise then the danger is the user has insufficient context.

Perfection is the enemy of the good: something is better than nothing

Given enough resources, everything would be perfectly documented, and the documentation would always be in sync. However, this is not always achievable. Rather than holding off on making perfect documentation, it’s better to just put what you have out there.

Perhaps the current state of documentation is a google doc packed with unresolved comments in the margins, or a confusing GitHub ticket with lots of unthreaded comments. It’s important that this can be easily navigated to from relevant sections of the ontology, by at least other ontology developers. I would also advocate for inlining links to this documentation from inside the ontology; this can be clearly labeled as being links to internal documentation so as not to violate the no-jargon principle.

Overall it is both hard and time-consuming to write optimal documentation. When I look back at documentation I have written I often feel I haven’t done a great job, I use jargon too much, or crucial nuances are not well communicated. But we are still learning as a community what the best practices are here, and most of us are drastically under-resourced for ontology development, so all we can do is our best and hope to learn and improve.

Never mind the logix: taming the semantic anarchy of mappings in ontologies

Mappings between ontologies, or between an ontology and an ontology-like resource, are a necessary fact of life when working with ontologies. For example, GO provides mappings to external resources such as KEGG, MetaCyc, RHEA, EC, and many others. Uberon (a multi-species anatomy ontology) provides mappings to species-specific anatomy ontologies like ZFA, FMA, and also to more specialized resources such as the Allen Brain Atlases. These mappings can be used for a variety of purposes, such as data integration – data annotated using different ontologies can be ‘cross-walked’ to use a single system.

Screen Shot 2019-05-26 at 7.53.12 PMOxo Mappings: Mappings between ontologies and other resources, visualized using OxO, with UBERON mapping sets highlighted.

Ontology mapping is a problem. With N resources, with each resource providing their own mappings to other resources, we have the potential of N^2-N mappings. This is expensive to produce, maintain, and is inherently error-prone, and can be frustrating for users if mappings do not globally agree. With the addition of third party mapping providers, the number of combinations increases.

One approach is to make an ‘uber-ontology’ that unifies the field, and do all mappings through this (reducing the number of mappings to N, and inferring pairwise mappings). But sometimes this just ends up producing another resource that needs to be mapped. And so the cycle continues.

managing-mappings-in-robot-e1558925811975.png

N^2 vs Uber. With 4 ontologies, we have 12 sets of mappings (each edge denotes 2 sets of mappings, since reciprocal calls may not agree). With the Uber approach we reduce this to 4, and can infer the pairwise mappings (inferred mapping sets as dotted lines). However, the Uber may become another resource meaning we now have 20 mappings.

Ideally we would have less redundancy and more cooperation, reducing the need for mappings. The OBO Foundry is based on the idea of groups coordinating and agreeing on how a domain is to be carved up, reducing redundancy, and leading to logical relationships (not mappings) between classes. For example, CHEBI and metabolic branches of GO cover different aspects of the same domain. Rather than mapping between classes, we have logical relationships, such as GO:serine biosynthesis has-output CHEBI:serine.

Even within OBO, mappings can be useful. Formally Uberon is orthogonal to species-specific anatomy ontologies such as ZFA. Classes in Uberon are formally treated as superclasses of ZFA classes, and so is not really a ‘mapping’. But for practical purposes, it can help to treat these links the same way we treat mappings between an OBO class and an entry in an outside resource, because people want to operate on it in the same way as they do other mappings.

Ontology mapping is a rich and active field, encompassing a large variety of techniques, leveraging lexical properties or structural properties of the ontology to automate or semi-automate mappings. See the Ontology Alignment Evaluation Initiative for more details.

I do not intend to cover alignment algorithms here, rich an interesting as a topic as this is (this may be the subject of a future post). I want to deal with the more prosaic issue of how we provide mappings to users, which is not something we do a great job of with OBO. This is also tied with the issue of how ontology developers maintain mappings for their ontology, which is also something we don’t do a great job of. I want to restrict this post just to the subject of how we represent mappings in the ontology files we produce for the community; mappings can also be queried via APIs but this is another topic.

This may not be the most thrilling topic, but I bet many of you have struggled with and cursed at this issue for a while. If so, your comments are most welcome here.

There are three main ways that mappings are handled in the OWL files we produce (including obo format files; obo format is another serialization of OWL), which can cause confusion. These are: direct logical axioms, xrefs, and skos. You might ask why we don’t just pick one. The answer is that each serves overlapping but distinct purposes. Also, there are existing infrastructures and toolchains that rely on doing it one way, and we don’t want to break things. But there are probably better ways of doing things, this post is intended to spur discussion on how to do this better.

Expressing Mappings in OWL

Option 1. Direct logical axioms

OWL provides constructs that allow us to unambiguously state the relationship between two things (regardless of whether the things are in the same ontology or two different ones). If we believe that GO:0000010 (trans-hexaprenyltranstransferase activity) and RHEA:20836 are equivalent, we can write this as:

GO:0000010 owl:equivalentClass RHEA:20836

This is a very strong statement to make, so we had better be sure! Fortunately RHEA makes the semantics of each of their entries very precise, with a precise CHEBI ID with a specific structure for each participant:
Screen Shot 2019-05-26 at 7.58.48 PM.png

If instead we believe the GO class to be broader (perhaps if the reactants were broader chemical entities) we could say

RHEA:20836 owl:subClassOf GO:0000010

(there is no superClassOf construct in OWL, so we must express this as the semantically equivalent structural form with the narrower class first).

In this case, the relationship is equivalence. Note that GO and RHEA curators have had many extensive discussions about the semantics of their respective resources, so we can be extra sure.

Sometimes the relationship is more nuanced, but if we understand the OWL interpretation of the respective classes we can usually write the relationship in a precise an unambiguous way. For example, the Uberon class heart is species-agnostic, and encompasses the 4 chambered heart of mammals as well as simpler structures found in other vertebrates (it doesn’t encompass things like the dorsal vessel of Drosophila, but there is a broader class of circulatory organ for such things). In contrast the Zebrafish Anatomy (ZFA) class with the same name ‘heart’ only covers Danio.

If you download the uberon OWL bridging axioms for ZFA, you will see this is precisely expressed as:

ZFA:0000114 EquivalentTo (UBERON:0000948 and part_of some NCBITaxon:7954)

(switching to Manchester syntax here for brevity)

i.e the ZFA heart class is the same as the uberon heart class when that heart is part of a Danio. In uberon we call this axiom pattern a “taxon equivalence” axiom. Note that this axiom entails that the Uberon heart subsumes the ZFA heart.

Venn diagram illustrating intersection of uberon heart and all things zebrafish is the zebrafish heart
There are obvious advantages to expressing things directly as OWL logical axioms. We are being precise, and we can use OWL reasoners to both validate and to infer relationships without programming ad-hoc rules.

For example, imagine we were to make an axiom in Uberon that says every heart has two ventricles and two atria (we would not in fact do this, as Uberon is species-agnostic, and this axiom is too strong if the heart is to cover all vertebrates). ZFA may state that the ZFA class for heart has a single one of each. If we then include the bridging axiom above we will introduce an unsatisfiability. We will break ZFA’s heart. We don’t want to do this, as Uberon [:heart:] ZFA.

As another example, if we make a mistake and declare two distinct GO classes to be equivalent to the same RHEA class, then through the properties of transitivity and symmetry of equivalence, we infer the two GO classes to be equivalent.

Things get even more interesting when multiple ontologies are linked. Consider the following, in which the directed black arrows denote subClassOf, and the thick blue lines indicate equivalence axioms. Note that all mappings/equivalences are locally 1:1. Can you tell which entailments follow from this?

3 way equivalence sets

Answer: everything is entailed to be equivalent to everything else! It’s just one giant clique (this follows from the transitivity property of equivalence; as can be seen, anything can be connected by simply hopping along the blue lines). This is not an uncommon structure, as we often see a kind of “semantic slippage” where concepts shift slightly in concept space, leading to global collapse.

mappings between EC, GO, MetaCyc, Rhea, and KEGG

Above is another more realistic example; when we try and treat mutual mappings between EC, GO, MetaCyc, RHEA, and KEGG. Grey lines indicate mappings provided by individual sources. Although this nominally means the two entries are the same, this cannot always be the case: as we follow links we traverse up and down the hierarchy, illustrating how ‘semantic slippage’ between similar resources leads to incoherence.

As we use ROBOT as part of the release process, we automatically detect this using the reason command, and the ontology editor can then fix the mappings.

Because equivalence means that any logical properties of one class can be substituted for the other, users can be confident in data integration processes. If we know the RHEA class has a particular CHEBI chemical as a participant, then the equivalent GO class will have the same CHEBI class as a participant. This is very powerful! We intend to use this strategy in the GO. Because RHEA is an expert curated database of Reactions, it doesn’t make sense for GO to replicate work in the leaf nodes of the GO MF hierarchy. Instead we declare the GO MF and RHEA classes as equivalent, and bring across expert curated knowledge, such as the CHEBI participants (this workflow is in progress, stay tuned).

Screen Shot 2019-05-26 at 7.58.48 PM

Coming soon to GO: Axiomatization of reactions using CHEBI classes via RHEA

So why don’t we just express all mappings as OWL logical axioms and be done with it? Well, it’s not always possible to be this precise, and there may be additional pragmatic concerns. I propose that the following criteria SHOULD or MUST be satisfied when making an OWL logical axiom involving an external resource:

  1. The entity in the external resource MUST have URIs denoting each class, and that URI SHOULD be minted by the external resource rather than a 3rd party.
  2. The external resource SHOULD have a canonical OWL serialization, maintained by the resource.
  3. That OWL serialization MUST be coherent and SHOULD accurately reflect the intent of the maintainer of that resource. This includes any upper ontology commitments.

The first criteria is fairly mundane but often a tripping point. You may have noticed in the axioms above I wrote URIs in CURIE form (e.g. GO:0000010). This assumes the existence of prefix declarations in the same document. E.g.

Prefix GO: <http://purl.obolibrary.org/obo/GO_>

Prefix UBERON: <http://purl.obolibrary.org/obo/UBERON_>

Prefix ZFA: <http://purl.obolibrary.org/obo/ZFA_>

Prefix RHEA: <http://rdf.rhea-db.org/&gt;

For any ontology that is part of OBO, or any ontology ‘born natively’ in OWL the full URI is known. However, if we want to map to a resource like OMIM, do we use the URL that resolves to the website entry? These things often change (at one point they were NCBI URLs). Perhaps we use the identifiers.org URL? Or the n2t.net one? Unless we have consensus on these things then different groups will make different choices, and things won’t link up. It’s an annoying issue, but a very important and expensive one. It is outside the scope of this post, but important to bear in mind. See McMurry et al for more on the perils of identifiers.

The second and third criteria pertain to the semantics of the linked resource. Resources like MESH take great care to state they are not an ontology, so treating it as an ontology of OWL classes connected by subClassOf is not really appropriate (and gets you some strange places). Similarly, UMLS, which contains cycles in its subClassOf graph. Even in cases where the external resource is an ontology (or believes itself to be), can you be sure they are making the same ontological commitments as you?

This is important: In making an equivalence axiom, you are ‘injecting’ entailments into the external resource, when all resources are combined (i.e a global view). This could lead to global errors (i.e errors that are only manifest when all resources are integrated). Or it could be seen as impolite to inject without commitment from the maintainers of the external resource.

Scenario: If I maintain an ontology of neoplasms, and I have axioms stating my neoplasms are BFO material entities, and I make equivalence axioms between my neoplasms and the neoplasm hierarchy in NCIT I may be ignoring an explicit non-commitment about that nature of the neoplasm hierarchy in NCIT. This could lead to global errors, such as when we see that NCIT classifies Lynch syndrome in the neoplasm hierarchy (see figure). Also, if I were the NCIT maintainers, I may be a bit miffed about other people making ontological commitments on my behalf, especially if I don’t agree with them.
ncit.png

Example of injecting commitments. White boxes indicate current NCIT classes, arrows are OWL subClassOf edges. The yellow ontology insists the NCIT neoplasm is equivalent to its neoplasm, which is committed to be a material entity. The cyan ontology doesn’t care about neoplasm per se, and wants to make the NCIT class for generic disorder equivalent to its own genetic disease, which is committed to be a BFO disposition (BFO is black boxes), which is disjoint with material entity. As a result, the global ontology that results from merging these axioms is incoherent: HNCC and its subclass Lynch syndrome become unsatisfiable.

Despite these caveats, it can be really useful to sometimes ‘overstate’ and make explicit logical axioms even when technical or semantic criteria are not met. These logical axioms can be very powerful for validation and data integration. However, I would recommend in general not distributing these overstated axioms with the main ontology. Instead they can be distributed as separate bridging axioms that must be explicitly included, and documenting these bridge axioms and any caveats. An example of this is the bridge axioms to MOD anatomy ontologies with Uberon.

To be clear, this caveat does not apply to cases such as axioms that connect GO and CHEBI. First these are not even ‘mappings’ except in the broadest sense. And second, there is clarity and agreement on the semantics of the respective classes so we can hopefully be sure the axioms make sense and don’t inject unwanted inferences.

In summary, OWL logical axioms are very powerful, which can be very useful, but remember, with great power comes great responsibility.

Option 2. Use oboInOwl hasDbXref property

Before there was OWL, there was OBO-Format. And lo, OBO-Format gave us the xref. Well not really, the xref was just an example of the long standing tradition of database cross-reference in bioinformatics. In bioinformatics we love minting new IDs. For any given gene you may have its ENSEMBL ID, it’s MOD or HGNC ID, it’s OMIM ID, it’s NCBI Gene/Entrez ID, and a host of other IDs in other databases. The other day I caught my cat minting gene IDs. It’s widespread. This necessitates a system of cross-references. These are rarely 1:1, since there are reasons for representations in different systems to diverge. The OBO-Format xref was for exactly the same use case. When GO started, there were already similar overlapping databases and classifications, including longstanding efforts like EC.

 

In the OWL serialization of OBO-Format (oboInOwl) this becomes an annotation assertion axioms using the oboInOwl:hasDbXref property. Many ontologies such as GO, HPO, MONDO, UBERON, ZFA, DO, MP, CHEBI, etc continue to use the xref as the primary way to express mappings, even though they are no longer tied to obo format for development.

Below is an example of a GO class with two xrefs, in OBO format

[Term]
id: GO:0000010
name: trans-hexaprenyltranstransferase activity
namespace: molecular_function
def: “Catalysis of the reaction: all-trans-hexaprenyl diphosphate + isopentenyl diphosphate = all-trans-heptaprenyl diphosphate + diphosphate.” [KEGG:R05612, RHEA:20836]
xref: KEGG:R05612
xref: RHEA:20836
is_a: GO:0016765 ! transferase activity, transferring alkyl or aryl (other than methyl) groups

[

 

The same thing in the OWL serialization:

<owl:Class rdf:about=”http://purl.obolibrary.org/obo/GO_0000010″&gt;
<rdfs:subClassOf rdf:resource=”http://purl.obolibrary.org/obo/GO_0016765″/&gt;
<obo:IAO_0000115 rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>Catalysis of the reaction: all-trans-hexaprenyl diphosphate + isopentenyl diphosphate = all-trans-heptaprenyl diphosphate + diphosphate.</obo:IAO_0000115>
<oboInOwl:hasDbXref rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>KEGG:R05612</oboInOwl:hasDbXref&gt;
<oboInOwl:hasDbXref rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>RHEA:20836</oboInOwl:hasDbXref&gt;

 

Note that the value of hasDbXref is always an OWL string literal (e.g. “RHEA:20836”). This SHOULD always be CURIE syntax identifier (i.e prefixed), although note that any expansion to a URI is generally ambiguous. The recommendation is that the prefix should be registered somewhere like the GO db-xref prefixes or prefixcommons, but prefix registries may not agree on a canonical prefix (See McMurry et al ), leading to the need to repair prefixes when merging data. E.g. one group may use “MIM” another “OMIM”.

This all poses the question:

So what does xref actually mean?

The short answer is that it can mean whatever the provider wants it to mean. Often it means something like “these two things are the same”, but there is no guarantee a mapping means equivalence in the OWL sense, or is even 1:1. In fact sometimes an xref is often stretched for other use cases. In GO, we have always xreffed between GO classes and InterPro: this means “any protein with this domain will have this function” (which is incredibly useful for functional annotation). Xrefs between GO and Reactome mean “this reactome entry is an example of this GO class”. Some ontologies like ORDO and MONDO have axioms on their annotations that attempt to provide additional metadata about the mapping, but this is not standardized. In the past, xrefs were used to connect phenotype classes to anatomy classes (e.g. for “abnormal X” terms); however, this usage has now largely been superseded by more precise logical axioms (see above) through projects like uPheno. In uberon, an xref can connect equivalent classes, or taxon equivalents. Overall xref is used very broadly, and can mean many things depending on unwritten rules.

This is SEMANTIC ANARCHY!

never mind the logix: picture of anarchist owl with anarchy symbol. ANARCHY IN THE ONTOLOGY [sex pistols font]

This causes some to throw their hands up in despair. However, many manage to muddle along. Usually xrefs are used consistently within an ontology for any given external resource. Ideally there is clear documentation for each set of mappings, but unfortunately this is not always the case. Many consumers of ontologies may be making errors and propagating information across xrefs that are not one-to-one or equivalent. In many scenarios this could result in erroneous propagation of gene functions, or erroneous propagation of information about a malignant neoplasm to its benign analog, which could have bad consequences.

Increasingly ontologies will publish more precise logical axioms alongside their xrefs (Uberon has always done this), but in practice the xrefs are used more widely, despite their issues.

How widely are they used? There are currently almost 1.5 million distinct hasDbXref values in OBO at the moment. 175 ontologies in OntoBee make use of hasDbXref annotations (may be an overestimate due to imports). The ontologies that have the most xrefs are PR, VTO, TTO, CHEBI, and MONDO (covering distinct proteins, taxa, chemicals – areas we would expect high identifier density). These have myriad uses inside multiple database pipelines and workflows, so even if a better solution to the xref is proposed, we can’t just drop xrefs as this would break all of the things (that would be truly anarchic).

But it must also be acknowledged that xrefs are crusty and have issues, see this comment from Clement Jonquet for one example.

Option 3. Use SKOS vocabulary for mapping properties

In the traditional tale of Goldilocks and the three OWLs, Goldilocks tries three bowls of semantic porridge. The first is too strong, the second too weak, and the third one is just right. If the first bowl is OWL logical axioms, the second bowl is oboInOwl xrefs, the third bowl would be the Simple Knowledge Organization System (SKOS) mapping vocabulary.

This provides a hierarchy of mapping properties

  • mappingRelation
    • closeMatch
    • broadMatch
    • narrowMatch
    • exactMatch

These can be used to link SKOS concepts across different concept schemes. The exactMatch property has the properties of transitivity and symmetry, but is still weaker than owl equivalence as it lacks the property of substitutibility. SKOS properties are axiomatized allowing entailment. Note that broad and narrow match are not transitive, but they both entail broader transitive properties transitiveBroadMatch and narrowBroadMatch.

Using skos mapping relations, we can map between an OBO ontology and MESH without worrying about the lack of OWL semantics for MESH. We can use exactMatch for 1:1 mappings, and closeMatch if we are less confident. We don’t have to worry about injecting semantics, it’s just a mapping!

Many people are like Goldilocks and find this to be just the right amount of semantics. But note that we can’t express things like our Uberon-ZFA heart relationship precisely here.

There are some other issues. SKOS doesn’t mix well with OWL as the SKOS properties need to be object properties for the SKOS entailment rules to work, and this induces punning. See also SKOS and OWL, and also the paper SKOS with OWL: Don’t be Full-ish! By Simon Jupp (I strongly approve of puns in paper titles). These outline some of the issues. However, for practical purposes I believe it is OK to mix SKOS and OWL.

It should also be noted that unlike oboInOwl xrefs, SKOS mapping relations should only be used between two URIs. This involves selecting a canonical URI for classes in a resource, which is not always easy (see notes on OWL above).

Where do we go now?

As I have hopefully shown, different representations of mappings serve different purposes. In particular, OWL direct axiomatiziation provides very precise semantics with powerful entailments, but its use sometimes involves overstepping and imposing ontological commitments. And it lacks a way to indicate fuzziness. E.g. we may want to make a non 1:1 mapping.

OboInOwl xrefs are somewhat surplus to requirements, when we can see we can express things a little bit more explicitly using SKOS, while remaining just the right side of fuzziness. However, vast swathes of infrastructure will ignore SKOS and expect xrefs (usually in OBO format).

I want it all!

So why not include xrefs, skos AND owl direct axioms in the release of an ontology? Well we have started to do this in some cases!

In MONDO, we publish an OWL version that has OWL equivalence axioms connecting to external resources. These are left ‘dangling’. A lot of tools don’t deal with this too well, so we also make an obo version that excludes these logical axioms. However, we use the equivalence axioms in Monarch, for consistency checking and data integration.

In both obo format and owl editions, we include BOTH skos AND xrefs. Thus clients can choose which of these they like. The xrefs are more popular, and are consumed in many pipelines. They are expressed as CURIE style IDs rather than URIs, which is annoying for some purpises, but preferred in others. The skos mappings provide a bit more precision, and allow us to distinguish between close and exact mappings. They also connect IRIs.

Note the xrefs in MONDO also communicate additional information through axiom annotations. These could potentially be put onto both the skos and the OWL axioms but we haven’t done that yet.

This is potentially confusing, so we do our best to document each product on the OBO page. We want to give a firm “service level agreement” to consumes of the different files.

For Uberon, we have always supported both xrefs and precise logical axioms (the latter downloaded from a separate file). For a while we attempted to communicate the semantics of the xref with a header in the obo file (the ‘treat-xrefs-as-X’ header tags in obo format), but no one much cared about these. Many folks just want xrefs and intuit what to do with them. We will also provide SKOS mappings in Uberon in the future.

So by being pluralistic and providing all 3 we can have our semantic cake and eat it. The downside here is that people may find the plethora of options confusing. There will need to be good documentation on which to use when. We will also need to extend tooling – e.g. add robot commands to generate the different forms, given some source of mappings and rules. This latter step is actually quite difficult due to the variety of ways in which ontology developers manage mappings in their ontologies (some may manage as xrefs; others as external TSVs; others pull them from upstream, e.g. as GO does for interpro2go).

Comments welcome!!! You can also comment on this ticket in the ontology metadata tracker.

Just give me my TSVs already

At the end of the day, a large number of users are confused by all this ontological malarkey and just want a TSV. It’s just 2 columns dude, not rocket science! Why do you people have to make it so complicated?

Unfortunately we don’t do a great job of providing TSVs in a consistent way. GO provides the mappings in a separate TSV-like format whose origins are lost in the mists of time, that is frankly a bit bonkers. Other ontologies will provide various ad-hoc TSVs of mappings but this is not done consistently across ontologies.

I feel bad about this and would really like to see a standard TSV export to be rolled out more universally. We have an open ticket in ROBOT, comments welcome here: https://github.com/ontodev/robot/issues/312

There are a few things that need to be decided on. E.g. keep it simple with 2 columns, include labels of concepts, include additional metadata including type of mapping (e.g. skos predicate)?

 

TSV? That’s so retro. This OWL is full of angle brackets. Is this 2005? The web is based on JSON

I have a post on that ! https://douroucouli.wordpress.com/2016/10/04/a-developer-friendly-json-exchange-format-for-ontologies/

And there is also JSON-LD which will be semantically equivalent to any OWL serialization.

So basically the syntax is not so relevant, the information in the JSON is the same, and we have the same choices of logical axiom, xref, or skos.

Summary

This is more than I intended to write on what seems like a simple matter of standardizing the representation of simple mappings. But like many things, it’s not quite so simple when you scratch beneath the surface. We have differences in how we write ID/URIs, differences in degrees of semantic strength, and a lot of legacy systems that expect things just so which always make things more tedious.

Maybe one day we won’t need mappings as everything will be OBO-ized, there will be no redundancy, and the relationship between any two classes will be explicit in the form of unambiguous axioms. Until that day it looks like we still need mappings, and there will be a need to provide a mix of xrefs, skos, and sometimes overstated OWL logical axioms.

 

Parting thoughts on prefixes

Converting between CURIE strings and full URIs is often necessary for interconversion and integration. Usually this is done by some external piece of code, which can be annoying if you are doing everything in a declarative way in SPARQL. This is because the mapping between a CURIE and a URI is treated as syntactic by RDF tools, the CURIE isn’t a first-class entity (prefix declarations aren’t visible after parsing)

One thing I have started doing is including explicit prefix declarations using the SHACL vocabulary. Here is an example from the ENVO repo where we are mapping to non-OBO ontologies and classifications like SWEET, LTER:

@prefix owl: <http://www.w3.org/2002/07/owl#&gt; .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt; .
@prefix obo: <http://purl.obolibrary.org/obo/&gt; .

@prefix sh: <http://www.w3.org/ns/shacl#&gt; .

<http://purl.obolibrary.org/obo/envo/imports/prefixes.owl&gt;
a owl:Ontology ;
rdfs:label “Prefix declarations”@en ;
rdfs:comment “Prefixes used in xrefs.”@en ;
sh:declare [
sh:prefix “SWEET” ;
sh:namespace “http://sweetontology.net/&#8221; ;
] ;
sh:declare [
sh:prefix “LTER” ;
sh:namespace “http://vocab.lternet.edu?tema=&#8221; ;
] ;
sh:declare [
sh:prefix “MEO” ;
sh:namespace “http://purl.jp/bio/11/meo/&#8221; ;
] .

   

The nice thing about this is that it allows the prefixes to be introspected in SPARQL allowing interconversion between CURIE string literals and URIs. E.g. this SPARQL will generate SKOS triples from xrefs that have been annotated in a particular way:

prefix owl: <http://www.w3.org/2002/07/owl#&gt;
prefix skos: <http://www.w3.org/2004/02/skos/core#&gt;
prefix oio: <http://www.geneontology.org/formats/oboInOwl#&gt;

prefix sh: <http://www.w3.org/ns/shacl#&gt;

CONSTRUCT {
?c skos:exactMatch ?xuri
}
WHERE {
?ax owl:annotatedSource ?c ;
owl:annotatedTarget ?x ;
owl:annotatedProperty oio:hasDbXref ;
oio:source “ENVO:equivalentTo” .

bind( strbefore(?x, “:”) as ?prefix)

?decl sh:prefix ?prefix ;
sh:namespace ?ns .

bind( strafter(?x, “:”) as ?suffix)
bind( uri(concat(?ns, ?suffix)) AS ?xuri)
}

 

OntoTip: Single-inheritance principle considered dangerous

This is one post in a series of tips on ontology development, see the parent post for more details.

Screen Shot 2019-05-10 at 11.03.22 AM.pngA Porphyrian tree. With apologies to The Shamen, Ebeneezer Goode

The idea of classification using tree structures can be traced to the 3rd century CE and the Greek philosopher Porphyry’s Trees depicting Aristotle’s categories. The tree has enjoyed a special status despite the realization that nature can be classified along multiple axes, leading to polyhierarchies or Directed Acyclic Graphs (DAGs).

It is unfortunately still lore in some parts of that ontology community that Multiple Inheritance (MI) is bad and that Single Inheritance (SI) ontologies are somehow purer or better. This is dangerous advice, although there is a kernel of truth here. Unfortunately this kernel of truth has been misunderstood and miscommunicated, usually with bad results.

In fact, it is good ontology engineering practice to never assert MI, only infer it (see the forthcoming post on ‘Rector Normalization’). Following the Rector normalization methodology, the “primitive skeleton” should ideally form a tree, but the domain ontologies defined using these skeletons will be inferred polyhierarchies, with MI up the wazoo. This has no effect on the end-user who still consumes the ontology as a polyhierarchy, but has a huge benefit for the ontology maintainer. It should also be noted here that we are only talking about SubClass (aka is-a, aka subsumption) relationships here (see further on for notes on part-of).

Mungalls-Ontology-Design-Guidelines (1)Figure: Simplified example of Rector Normalization: two primitive ontologies combined into compositional classes yielding a polyhierarchy.

And additionally, it should be noted that it is also true that some ontologies do engage in IsA-overloading, which can lead to problems. The problem is that this kernel of truth has been miscommunicated, and some still cling to a purist notion of SI that is harmful. This miscommunication has resulted in ontologies that deliberately omit important links. Users are often unaware of this fact, and unaware that they are getting incomplete results.

Examples of problematic promulgations of SI

You may be reading this and are wondering why SI in ontologies would even be a thing, given that ever since the 2000 Gene Ontology paper (and probably before this) the notion of an ontology classification as a MI/DAG has been de rigueur. You may be thinking “why is he even wasting his time writing this post, it’s completely obvious”. If so, congratulations! Maybe you don’t need this article, but it may still be important for you to know this as a user, since some of the ontologies you use may have been infected with this advice. If you are part of the ontology community you have probably heard conflicting or confusing advice about SI. I hope to dispel that bad advice here. I wish I didn’t have to write this post but I have seen so much unnecessary confusion caused by this whole issue, I really want to put it to bed forever.

 

Here are some examples of what I consider confusing or conflicting advice:

 

1) This seminal paper from Schulze-Kremer and Smith has some excellent advice, but also includes the potentially dangerous:

Multiple inheritance should be carefully applied to make sure that the resulting subclasses really exist. Single inheritance is generally safer and easier to understand.

This is confusing advice. Of course, any axiom added to an ontology has to be applied carefully. It’s not clear to me that SI is easier to understand. Of course, maintaining a polyhierarchy is hard, but the advice here should be to infer the polyhierarchy.

 

2) The book Building Ontologies with BFO has what I consider conflicted and confusing advice. It starts well, with the recommended advice to not assert MI, but instead to infer it. It then talks about the principle of single inheritance. I hold that elevating SI to “principle” is potentially dangerous due to the likelihood of miscommunication. It lists 5 purported reasons for adhering to SI:

  1. computational benefits [extremely dubious, computers can obviously handle graphs fine]
  2. Genus-differentia definitions [I actually agree, see future post]
  3. enforces discipline on ontology maintainers to select the “correct” parent [dubious, and talk of “enforcing discipline” is a red flag]
  4. ease of combining ontologies [very dubious]
  5. users find it easier to find the terms they require using an “official” SI monohierarchical version of the ontology [dubious/wrong. this confuses a UI issue with an ontology principle, and conflicts with existing practice].

3) The Foundational Model of Anatomy (FMA) is a venerable ontology in the life sciences, it’s FAQ contains a very direct statement advocating SI:

2) Why do the FMA authors use single inheritance?

The authors believe that single inheritance assures the true essence of a class on a given context.

I don’t understand what this means, and as I show in the example below, this adherence to the SI is to the detriment of users of the FMA.

4) The disease ontology HumanDO.obo file is single-inheritance, as is the official DO browser.

doid-non-classified.obo : DO’s single asserted is_a hierarchy (HumanDO.obo), this file does not contain logical definitions

Screen Shot 2019-05-10 at 11.07.22 AM

Figure: official browser for DO is SI: lung carcinoma is classified as lung cancer, but no parentage to carcinoma. Users querying the SI version for carcinoma would not get genes associated with lung carcinoma. Most other browsers such as OLS and the Alliance disease pages show the MI version of DO, where this class correctly has two is-a parents.

The SI principle leads to massively incomplete query results; an example from the FMA

I will choose the FMA as an example of why the SI principle in its strict form is dangerous. The following figure shows the classification of proximal phalanx of middle finger (PPoMF) in FMA. The FMA’s provided is-a links are shown as black arrows. I have indicated the missing is-a relationship with a dashed red line.
Mungalls-Ontology-Design-Guidelines (2)
The FMA is missing a relationship here. Of course, many ontologies have missing relationships, but in this case the missing relationship is by design. If the FMA were to add this relationship it would be in violation of one of its stated core principles. In elevating this purist SI principle, the FMA is less useful to users.  For example, if I want to classify skeletal phenotypes, and I have a phenotype involving the proximal phalanx of middle finger (PPoMF), and a user queries for proximal phalanx of [any] finger (PPoF), the user will not get the expected results. Unfortunately, there many cases of this in the FMA, since so many of the 70k+ classes are compositional in nature, and many FMA users may be unaware of these missing relationships. One of the fundamental use cases of ontologies is to be able to summarize data at different levels, and to discard this in the name of purity seems to me to be fundamentally wrongheaded. It is a simple mathematical fact that when you have compositional classes in an ontology, you logically have multiple inheritance.

This exemplifies a situation where SI is the most dangerous, when the following factors occur together: (1) the ontology is SI (2) the ontology has many compositional classes (there are over 70k classes in the FMA) (3) there are no logical definitions / Rector normalization, hence no hope of inferring the complete classification (4) The ontology is widely used.

The FMA is not the only ontology to do this. Others do this, or have confusing or contradictory principles around SI. I have chosen to highlight the FMA as it is a long established ontology, it is influential and often held up as an exemplar ontology by formal ontologists, and it is unambiguous in its promulgation of SI as a principle. It should be stressed that FMA is excellent in other regards, but is let down by allowing dubious philosophy to trump utility.

 

The advice we should be giving

The single most important advice we should be giving is that ontologies must be useful for users, and they must be complete, and correct. Any other philosophical or engineering principle or guideline that interferes with this must be thrown out.

We should also be giving advice on how to build ontologies that are easy to maintain, and have good practice to ensure completeness and correctness. One aspect of this is that asserting multiple is-a parents is to be avoided, these should be inferred. In fact this advice is largely subsumed within any kind of tutorial on building OWL ontologies using reasoning. Given this, ontologists should stop making pronouncements on SI or MI at all, as this is prone to misinterpretation. Instead the emphasis for ontology developers should be on good engineering practice.

TL;DR

multiple inheritance is fine. If you’re building ontologies add logical definitions to compositional classes, and use reasoning to infer superclasses.

obographviz 0.2.2 released – now with visualization of equivalence cliques

Version 0.2.2 of out obographviz javascript library is up on npm:

https://www.npmjs.com/package/obographviz

obographviz converts ontologies (in  OBO Graph JSON) to dot/graphviz with powerful JSON stylesheet configuration. The JSON style sheets allow for configurable color and sizing of nodes and edges. While graphviz can look a bit static and clunky compared to these newfangled javascript UI libraries, I still find it the most useful way to visualize complex dense ontology graphs.
One of the most challenging things to visualize is the parallel structure of multiple ontologies covering a similar area (e.g. multiple anatomy ontologies, or phenotype ontologies, each covering a different species).
This new release allows for the nesting of equivalence cliques from either xrefs or equivalence axioms. This can be used to visualize the output of algorithms such as our kBOOM (Bayesian OWL Ontology Merging) tool. Starting with a defined set of predicates/properties (for example, the obo “xref” property, or an owl equivalentClasses declaration between two classes), it will transitively expand these until a maximal clique is found for each node. All such mutually connected nodes with then be visualized inside one of these cliques.
An example is shown below (Uberon (yellow) aligned with ZFA (black) and two Allen Brain Atlas ontologies (grey and pink); each ontology has its own color; any set of species-equivalent nodes are clustered together with a box drawn around them. isa=black, part_of=blue). Classes that are unique to a single ontology are shown without a bounding box.
uberon-zfa-xref-example