Never mind the logix: taming the semantic anarchy of mappings in ontologies

Mappings between ontologies, or between an ontology and an ontology-like resource, are a necessary fact of life when working with ontologies. For example, GO provides mappings to external resources such as KEGG, MetaCyc, RHEA, EC, and many others. Uberon (a multi-species anatomy ontology) provides mappings to species-specific anatomy ontologies like ZFA, FMA, and also to more specialized resources such as the Allen Brain Atlases. These mappings can be used for a variety of purposes, such as data integration – data annotated using different ontologies can be ‘cross-walked’ to use a single system.

Screen Shot 2019-05-26 at 7.53.12 PMOxo Mappings: Mappings between ontologies and other resources, visualized using OxO, with UBERON mapping sets highlighted.

Ontology mapping is a problem. With N resources, with each resource providing their own mappings to other resources, we have the potential of N^2-N mappings. This is expensive to produce, maintain, and is inherently error-prone, and can be frustrating for users if mappings do not globally agree. With the addition of third party mapping providers, the number of combinations increases.

One approach is to make an ‘uber-ontology’ that unifies the field, and do all mappings through this (reducing the number of mappings to N, and inferring pairwise mappings). But sometimes this just ends up producing another resource that needs to be mapped. And so the cycle continues.

managing-mappings-in-robot-e1558925811975.png

N^2 vs Uber. With 4 ontologies, we have 12 sets of mappings (each edge denotes 2 sets of mappings, since reciprocal calls may not agree). With the Uber approach we reduce this to 4, and can infer the pairwise mappings (inferred mapping sets as dotted lines). However, the Uber may become another resource meaning we now have 20 mappings.

Ideally we would have less redundancy and more cooperation, reducing the need for mappings. The OBO Foundry is based on the idea of groups coordinating and agreeing on how a domain is to be carved up, reducing redundancy, and leading to logical relationships (not mappings) between classes. For example, CHEBI and metabolic branches of GO cover different aspects of the same domain. Rather than mapping between classes, we have logical relationships, such as GO:serine biosynthesis has-output CHEBI:serine.

Even within OBO, mappings can be useful. Formally Uberon is orthogonal to species-specific anatomy ontologies such as ZFA. Classes in Uberon are formally treated as superclasses of ZFA classes, and so is not really a ‘mapping’. But for practical purposes, it can help to treat these links the same way we treat mappings between an OBO class and an entry in an outside resource, because people want to operate on it in the same way as they do other mappings.

Ontology mapping is a rich and active field, encompassing a large variety of techniques, leveraging lexical properties or structural properties of the ontology to automate or semi-automate mappings. See the Ontology Alignment Evaluation Initiative for more details.

I do not intend to cover alignment algorithms here, rich an interesting as a topic as this is (this may be the subject of a future post). I want to deal with the more prosaic issue of how we provide mappings to users, which is not something we do a great job of with OBO. This is also tied with the issue of how ontology developers maintain mappings for their ontology, which is also something we don’t do a great job of. I want to restrict this post just to the subject of how we represent mappings in the ontology files we produce for the community; mappings can also be queried via APIs but this is another topic.

This may not be the most thrilling topic, but I bet many of you have struggled with and cursed at this issue for a while. If so, your comments are most welcome here.

There are three main ways that mappings are handled in the OWL files we produce (including obo format files; obo format is another serialization of OWL), which can cause confusion. These are: direct logical axioms, xrefs, and skos. You might ask why we don’t just pick one. The answer is that each serves overlapping but distinct purposes. Also, there are existing infrastructures and toolchains that rely on doing it one way, and we don’t want to break things. But there are probably better ways of doing things, this post is intended to spur discussion on how to do this better.

Expressing Mappings in OWL

Option 1. Direct logical axioms

OWL provides constructs that allow us to unambiguously state the relationship between two things (regardless of whether the things are in the same ontology or two different ones). If we believe that GO:0000010 (trans-hexaprenyltranstransferase activity) and RHEA:20836 are equivalent, we can write this as:

GO:0000010 owl:equivalentClass RHEA:20836

This is a very strong statement to make, so we had better be sure! Fortunately RHEA makes the semantics of each of their entries very precise, with a precise CHEBI ID with a specific structure for each participant:
Screen Shot 2019-05-26 at 7.58.48 PM.png

If instead we believe the GO class to be broader (perhaps if the reactants were broader chemical entities) we could say

RHEA:20836 owl:subClassOf GO:0000010

(there is no superClassOf construct in OWL, so we must express this as the semantically equivalent structural form with the narrower class first).

In this case, the relationship is equivalence. Note that GO and RHEA curators have had many extensive discussions about the semantics of their respective resources, so we can be extra sure.

Sometimes the relationship is more nuanced, but if we understand the OWL interpretation of the respective classes we can usually write the relationship in a precise an unambiguous way. For example, the Uberon class heart is species-agnostic, and encompasses the 4 chambered heart of mammals as well as simpler structures found in other vertebrates (it doesn’t encompass things like the dorsal vessel of Drosophila, but there is a broader class of circulatory organ for such things). In contrast the Zebrafish Anatomy (ZFA) class with the same name ‘heart’ only covers Danio.

If you download the uberon OWL bridging axioms for ZFA, you will see this is precisely expressed as:

ZFA:0000114 EquivalentTo (UBERON:0000948 and part_of some NCBITaxon:7954)

(switching to Manchester syntax here for brevity)

i.e the ZFA heart class is the same as the uberon heart class when that heart is part of a Danio. In uberon we call this axiom pattern a “taxon equivalence” axiom. Note that this axiom entails that the Uberon heart subsumes the ZFA heart.

Venn diagram illustrating intersection of uberon heart and all things zebrafish is the zebrafish heart
There are obvious advantages to expressing things directly as OWL logical axioms. We are being precise, and we can use OWL reasoners to both validate and to infer relationships without programming ad-hoc rules.

For example, imagine we were to make an axiom in Uberon that says every heart has two ventricles and two atria (we would not in fact do this, as Uberon is species-agnostic, and this axiom is too strong if the heart is to cover all vertebrates). ZFA may state that the ZFA class for heart has a single one of each. If we then include the bridging axiom above we will introduce an unsatisfiability. We will break ZFA’s heart. We don’t want to do this, as Uberon [:heart:] ZFA.

As another example, if we make a mistake and declare two distinct GO classes to be equivalent to the same RHEA class, then through the properties of transitivity and symmetry of equivalence, we infer the two GO classes to be equivalent.

Things get even more interesting when multiple ontologies are linked. Consider the following, in which the directed black arrows denote subClassOf, and the thick blue lines indicate equivalence axioms. Note that all mappings/equivalences are locally 1:1. Can you tell which entailments follow from this?

3 way equivalence sets

Answer: everything is entailed to be equivalent to everything else! It’s just one giant clique (this follows from the transitivity property of equivalence; as can be seen, anything can be connected by simply hopping along the blue lines). This is not an uncommon structure, as we often see a kind of “semantic slippage” where concepts shift slightly in concept space, leading to global collapse.

mappings between EC, GO, MetaCyc, Rhea, and KEGG

Above is another more realistic example; when we try and treat mutual mappings between EC, GO, MetaCyc, RHEA, and KEGG. Grey lines indicate mappings provided by individual sources. Although this nominally means the two entries are the same, this cannot always be the case: as we follow links we traverse up and down the hierarchy, illustrating how ‘semantic slippage’ between similar resources leads to incoherence.

As we use ROBOT as part of the release process, we automatically detect this using the reason command, and the ontology editor can then fix the mappings.

Because equivalence means that any logical properties of one class can be substituted for the other, users can be confident in data integration processes. If we know the RHEA class has a particular CHEBI chemical as a participant, then the equivalent GO class will have the same CHEBI class as a participant. This is very powerful! We intend to use this strategy in the GO. Because RHEA is an expert curated database of Reactions, it doesn’t make sense for GO to replicate work in the leaf nodes of the GO MF hierarchy. Instead we declare the GO MF and RHEA classes as equivalent, and bring across expert curated knowledge, such as the CHEBI participants (this workflow is in progress, stay tuned).

Screen Shot 2019-05-26 at 7.58.48 PM

Coming soon to GO: Axiomatization of reactions using CHEBI classes via RHEA

So why don’t we just express all mappings as OWL logical axioms and be done with it? Well, it’s not always possible to be this precise, and there may be additional pragmatic concerns. I propose that the following criteria SHOULD or MUST be satisfied when making an OWL logical axiom involving an external resource:

  1. The entity in the external resource MUST have URIs denoting each class, and that URI SHOULD be minted by the external resource rather than a 3rd party.
  2. The external resource SHOULD have a canonical OWL serialization, maintained by the resource.
  3. That OWL serialization MUST be coherent and SHOULD accurately reflect the intent of the maintainer of that resource. This includes any upper ontology commitments.

The first criteria is fairly mundane but often a tripping point. You may have noticed in the axioms above I wrote URIs in CURIE form (e.g. GO:0000010). This assumes the existence of prefix declarations in the same document. E.g.

Prefix GO: <http://purl.obolibrary.org/obo/GO_>

Prefix UBERON: <http://purl.obolibrary.org/obo/UBERON_>

Prefix ZFA: <http://purl.obolibrary.org/obo/ZFA_>

Prefix RHEA: <http://rdf.rhea-db.org/&gt;

For any ontology that is part of OBO, or any ontology ‘born natively’ in OWL the full URI is known. However, if we want to map to a resource like OMIM, do we use the URL that resolves to the website entry? These things often change (at one point they were NCBI URLs). Perhaps we use the identifiers.org URL? Or the n2t.net one? Unless we have consensus on these things then different groups will make different choices, and things won’t link up. It’s an annoying issue, but a very important and expensive one. It is outside the scope of this post, but important to bear in mind. See McMurry et al for more on the perils of identifiers.

The second and third criteria pertain to the semantics of the linked resource. Resources like MESH take great care to state they are not an ontology, so treating it as an ontology of OWL classes connected by subClassOf is not really appropriate (and gets you some strange places). Similarly, UMLS, which contains cycles in its subClassOf graph. Even in cases where the external resource is an ontology (or believes itself to be), can you be sure they are making the same ontological commitments as you?

This is important: In making an equivalence axiom, you are ‘injecting’ entailments into the external resource, when all resources are combined (i.e a global view). This could lead to global errors (i.e errors that are only manifest when all resources are integrated). Or it could be seen as impolite to inject without commitment from the maintainers of the external resource.

Scenario: If I maintain an ontology of neoplasms, and I have axioms stating my neoplasms are BFO material entities, and I make equivalence axioms between my neoplasms and the neoplasm hierarchy in NCIT I may be ignoring an explicit non-commitment about that nature of the neoplasm hierarchy in NCIT. This could lead to global errors, such as when we see that NCIT classifies Lynch syndrome in the neoplasm hierarchy (see figure). Also, if I were the NCIT maintainers, I may be a bit miffed about other people making ontological commitments on my behalf, especially if I don’t agree with them.
ncit.png

Example of injecting commitments. White boxes indicate current NCIT classes, arrows are OWL subClassOf edges. The yellow ontology insists the NCIT neoplasm is equivalent to its neoplasm, which is committed to be a material entity. The cyan ontology doesn’t care about neoplasm per se, and wants to make the NCIT class for generic disorder equivalent to its own genetic disease, which is committed to be a BFO disposition (BFO is black boxes), which is disjoint with material entity. As a result, the global ontology that results from merging these axioms is incoherent: HNCC and its subclass Lynch syndrome become unsatisfiable.

Despite these caveats, it can be really useful to sometimes ‘overstate’ and make explicit logical axioms even when technical or semantic criteria are not met. These logical axioms can be very powerful for validation and data integration. However, I would recommend in general not distributing these overstated axioms with the main ontology. Instead they can be distributed as separate bridging axioms that must be explicitly included, and documenting these bridge axioms and any caveats. An example of this is the bridge axioms to MOD anatomy ontologies with Uberon.

To be clear, this caveat does not apply to cases such as axioms that connect GO and CHEBI. First these are not even ‘mappings’ except in the broadest sense. And second, there is clarity and agreement on the semantics of the respective classes so we can hopefully be sure the axioms make sense and don’t inject unwanted inferences.

In summary, OWL logical axioms are very powerful, which can be very useful, but remember, with great power comes great responsibility.

Option 2. Use oboInOwl hasDbXref property

Before there was OWL, there was OBO-Format. And lo, OBO-Format gave us the xref. Well not really, the xref was just an example of the long standing tradition of database cross-reference in bioinformatics. In bioinformatics we love minting new IDs. For any given gene you may have its ENSEMBL ID, it’s MOD or HGNC ID, it’s OMIM ID, it’s NCBI Gene/Entrez ID, and a host of other IDs in other databases. The other day I caught my cat minting gene IDs. It’s widespread. This necessitates a system of cross-references. These are rarely 1:1, since there are reasons for representations in different systems to diverge. The OBO-Format xref was for exactly the same use case. When GO started, there were already similar overlapping databases and classifications, including longstanding efforts like EC.

 

In the OWL serialization of OBO-Format (oboInOwl) this becomes an annotation assertion axioms using the oboInOwl:hasDbXref property. Many ontologies such as GO, HPO, MONDO, UBERON, ZFA, DO, MP, CHEBI, etc continue to use the xref as the primary way to express mappings, even though they are no longer tied to obo format for development.

Below is an example of a GO class with two xrefs, in OBO format

[Term]
id: GO:0000010
name: trans-hexaprenyltranstransferase activity
namespace: molecular_function
def: “Catalysis of the reaction: all-trans-hexaprenyl diphosphate + isopentenyl diphosphate = all-trans-heptaprenyl diphosphate + diphosphate.” [KEGG:R05612, RHEA:20836]
xref: KEGG:R05612
xref: RHEA:20836
is_a: GO:0016765 ! transferase activity, transferring alkyl or aryl (other than methyl) groups

[

 

The same thing in the OWL serialization:

<owl:Class rdf:about=”http://purl.obolibrary.org/obo/GO_0000010″&gt;
<rdfs:subClassOf rdf:resource=”http://purl.obolibrary.org/obo/GO_0016765″/&gt;
<obo:IAO_0000115 rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>Catalysis of the reaction: all-trans-hexaprenyl diphosphate + isopentenyl diphosphate = all-trans-heptaprenyl diphosphate + diphosphate.</obo:IAO_0000115>
<oboInOwl:hasDbXref rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>KEGG:R05612</oboInOwl:hasDbXref&gt;
<oboInOwl:hasDbXref rdf:datatype=”http://www.w3.org/2001/XMLSchema#string”>RHEA:20836</oboInOwl:hasDbXref&gt;

 

Note that the value of hasDbXref is always an OWL string literal (e.g. “RHEA:20836”). This SHOULD always be CURIE syntax identifier (i.e prefixed), although note that any expansion to a URI is generally ambiguous. The recommendation is that the prefix should be registered somewhere like the GO db-xref prefixes or prefixcommons, but prefix registries may not agree on a canonical prefix (See McMurry et al ), leading to the need to repair prefixes when merging data. E.g. one group may use “MIM” another “OMIM”.

This all poses the question:

So what does xref actually mean?

The short answer is that it can mean whatever the provider wants it to mean. Often it means something like “these two things are the same”, but there is no guarantee a mapping means equivalence in the OWL sense, or is even 1:1. In fact sometimes an xref is often stretched for other use cases. In GO, we have always xreffed between GO classes and InterPro: this means “any protein with this domain will have this function” (which is incredibly useful for functional annotation). Xrefs between GO and Reactome mean “this reactome entry is an example of this GO class”. Some ontologies like ORDO and MONDO have axioms on their annotations that attempt to provide additional metadata about the mapping, but this is not standardized. In the past, xrefs were used to connect phenotype classes to anatomy classes (e.g. for “abnormal X” terms); however, this usage has now largely been superseded by more precise logical axioms (see above) through projects like uPheno. In uberon, an xref can connect equivalent classes, or taxon equivalents. Overall xref is used very broadly, and can mean many things depending on unwritten rules.

This is SEMANTIC ANARCHY!

never mind the logix: picture of anarchist owl with anarchy symbol. ANARCHY IN THE ONTOLOGY [sex pistols font]

This causes some to throw their hands up in despair. However, many manage to muddle along. Usually xrefs are used consistently within an ontology for any given external resource. Ideally there is clear documentation for each set of mappings, but unfortunately this is not always the case. Many consumers of ontologies may be making errors and propagating information across xrefs that are not one-to-one or equivalent. In many scenarios this could result in erroneous propagation of gene functions, or erroneous propagation of information about a malignant neoplasm to its benign analog, which could have bad consequences.

Increasingly ontologies will publish more precise logical axioms alongside their xrefs (Uberon has always done this), but in practice the xrefs are used more widely, despite their issues.

How widely are they used? There are currently almost 1.5 million distinct hasDbXref values in OBO at the moment. 175 ontologies in OntoBee make use of hasDbXref annotations (may be an overestimate due to imports). The ontologies that have the most xrefs are PR, VTO, TTO, CHEBI, and MONDO (covering distinct proteins, taxa, chemicals – areas we would expect high identifier density). These have myriad uses inside multiple database pipelines and workflows, so even if a better solution to the xref is proposed, we can’t just drop xrefs as this would break all of the things (that would be truly anarchic).

But it must also be acknowledged that xrefs are crusty and have issues, see this comment from Clement Jonquet for one example.

Option 3. Use SKOS vocabulary for mapping properties

In the traditional tale of Goldilocks and the three OWLs, Goldilocks tries three bowls of semantic porridge. The first is too strong, the second too weak, and the third one is just right. If the first bowl is OWL logical axioms, the second bowl is oboInOwl xrefs, the third bowl would be the Simple Knowledge Organization System (SKOS) mapping vocabulary.

This provides a hierarchy of mapping properties

  • mappingRelation
    • closeMatch
    • broadMatch
    • narrowMatch
    • exactMatch

These can be used to link SKOS concepts across different concept schemes. The exactMatch property has the properties of transitivity and symmetry, but is still weaker than owl equivalence as it lacks the property of substitutibility. SKOS properties are axiomatized allowing entailment. Note that broad and narrow match are not transitive, but they both entail broader transitive properties transitiveBroadMatch and narrowBroadMatch.

Using skos mapping relations, we can map between an OBO ontology and MESH without worrying about the lack of OWL semantics for MESH. We can use exactMatch for 1:1 mappings, and closeMatch if we are less confident. We don’t have to worry about injecting semantics, it’s just a mapping!

Many people are like Goldilocks and find this to be just the right amount of semantics. But note that we can’t express things like our Uberon-ZFA heart relationship precisely here.

There are some other issues. SKOS doesn’t mix well with OWL as the SKOS properties need to be object properties for the SKOS entailment rules to work, and this induces punning. See also SKOS and OWL, and also the paper SKOS with OWL: Don’t be Full-ish! By Simon Jupp (I strongly approve of puns in paper titles). These outline some of the issues. However, for practical purposes I believe it is OK to mix SKOS and OWL.

It should also be noted that unlike oboInOwl xrefs, SKOS mapping relations should only be used between two URIs. This involves selecting a canonical URI for classes in a resource, which is not always easy (see notes on OWL above).

Where do we go now?

As I have hopefully shown, different representations of mappings serve different purposes. In particular, OWL direct axiomatiziation provides very precise semantics with powerful entailments, but its use sometimes involves overstepping and imposing ontological commitments. And it lacks a way to indicate fuzziness. E.g. we may want to make a non 1:1 mapping.

OboInOwl xrefs are somewhat surplus to requirements, when we can see we can express things a little bit more explicitly using SKOS, while remaining just the right side of fuzziness. However, vast swathes of infrastructure will ignore SKOS and expect xrefs (usually in OBO format).

I want it all!

So why not include xrefs, skos AND owl direct axioms in the release of an ontology? Well we have started to do this in some cases!

In MONDO, we publish an OWL version that has OWL equivalence axioms connecting to external resources. These are left ‘dangling’. A lot of tools don’t deal with this too well, so we also make an obo version that excludes these logical axioms. However, we use the equivalence axioms in Monarch, for consistency checking and data integration.

In both obo format and owl editions, we include BOTH skos AND xrefs. Thus clients can choose which of these they like. The xrefs are more popular, and are consumed in many pipelines. They are expressed as CURIE style IDs rather than URIs, which is annoying for some purpises, but preferred in others. The skos mappings provide a bit more precision, and allow us to distinguish between close and exact mappings. They also connect IRIs.

Note the xrefs in MONDO also communicate additional information through axiom annotations. These could potentially be put onto both the skos and the OWL axioms but we haven’t done that yet.

This is potentially confusing, so we do our best to document each product on the OBO page. We want to give a firm “service level agreement” to consumes of the different files.

For Uberon, we have always supported both xrefs and precise logical axioms (the latter downloaded from a separate file). For a while we attempted to communicate the semantics of the xref with a header in the obo file (the ‘treat-xrefs-as-X’ header tags in obo format), but no one much cared about these. Many folks just want xrefs and intuit what to do with them. We will also provide SKOS mappings in Uberon in the future.

So by being pluralistic and providing all 3 we can have our semantic cake and eat it. The downside here is that people may find the plethora of options confusing. There will need to be good documentation on which to use when. We will also need to extend tooling – e.g. add robot commands to generate the different forms, given some source of mappings and rules. This latter step is actually quite difficult due to the variety of ways in which ontology developers manage mappings in their ontologies (some may manage as xrefs; others as external TSVs; others pull them from upstream, e.g. as GO does for interpro2go).

Comments welcome!!! You can also comment on this ticket in the ontology metadata tracker.

Just give me my TSVs already

At the end of the day, a large number of users are confused by all this ontological malarkey and just want a TSV. It’s just 2 columns dude, not rocket science! Why do you people have to make it so complicated?

Unfortunately we don’t do a great job of providing TSVs in a consistent way. GO provides the mappings in a separate TSV-like format whose origins are lost in the mists of time, that is frankly a bit bonkers. Other ontologies will provide various ad-hoc TSVs of mappings but this is not done consistently across ontologies.

I feel bad about this and would really like to see a standard TSV export to be rolled out more universally. We have an open ticket in ROBOT, comments welcome here: https://github.com/ontodev/robot/issues/312

There are a few things that need to be decided on. E.g. keep it simple with 2 columns, include labels of concepts, include additional metadata including type of mapping (e.g. skos predicate)?

 

TSV? That’s so retro. This OWL is full of angle brackets. Is this 2005? The web is based on JSON

I have a post on that ! https://douroucouli.wordpress.com/2016/10/04/a-developer-friendly-json-exchange-format-for-ontologies/

And there is also JSON-LD which will be semantically equivalent to any OWL serialization.

So basically the syntax is not so relevant, the information in the JSON is the same, and we have the same choices of logical axiom, xref, or skos.

Summary

This is more than I intended to write on what seems like a simple matter of standardizing the representation of simple mappings. But like many things, it’s not quite so simple when you scratch beneath the surface. We have differences in how we write ID/URIs, differences in degrees of semantic strength, and a lot of legacy systems that expect things just so which always make things more tedious.

Maybe one day we won’t need mappings as everything will be OBO-ized, there will be no redundancy, and the relationship between any two classes will be explicit in the form of unambiguous axioms. Until that day it looks like we still need mappings, and there will be a need to provide a mix of xrefs, skos, and sometimes overstated OWL logical axioms.

 

Parting thoughts on prefixes

Converting between CURIE strings and full URIs is often necessary for interconversion and integration. Usually this is done by some external piece of code, which can be annoying if you are doing everything in a declarative way in SPARQL. This is because the mapping between a CURIE and a URI is treated as syntactic by RDF tools, the CURIE isn’t a first-class entity (prefix declarations aren’t visible after parsing)

One thing I have started doing is including explicit prefix declarations using the SHACL vocabulary. Here is an example from the ENVO repo where we are mapping to non-OBO ontologies and classifications like SWEET, LTER:

@prefix owl: <http://www.w3.org/2002/07/owl#&gt; .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#&gt; .
@prefix obo: <http://purl.obolibrary.org/obo/&gt; .

@prefix sh: <http://www.w3.org/ns/shacl#&gt; .

<http://purl.obolibrary.org/obo/envo/imports/prefixes.owl&gt;
a owl:Ontology ;
rdfs:label “Prefix declarations”@en ;
rdfs:comment “Prefixes used in xrefs.”@en ;
sh:declare [
sh:prefix “SWEET” ;
sh:namespace “http://sweetontology.net/&#8221; ;
] ;
sh:declare [
sh:prefix “LTER” ;
sh:namespace “http://vocab.lternet.edu?tema=&#8221; ;
] ;
sh:declare [
sh:prefix “MEO” ;
sh:namespace “http://purl.jp/bio/11/meo/&#8221; ;
] .

   

The nice thing about this is that it allows the prefixes to be introspected in SPARQL allowing interconversion between CURIE string literals and URIs. E.g. this SPARQL will generate SKOS triples from xrefs that have been annotated in a particular way:

prefix owl: <http://www.w3.org/2002/07/owl#&gt;
prefix skos: <http://www.w3.org/2004/02/skos/core#&gt;
prefix oio: <http://www.geneontology.org/formats/oboInOwl#&gt;

prefix sh: <http://www.w3.org/ns/shacl#&gt;

CONSTRUCT {
?c skos:exactMatch ?xuri
}
WHERE {
?ax owl:annotatedSource ?c ;
owl:annotatedTarget ?x ;
owl:annotatedProperty oio:hasDbXref ;
oio:source “ENVO:equivalentTo” .

bind( strbefore(?x, “:”) as ?prefix)

?decl sh:prefix ?prefix ;
sh:namespace ?ns .

bind( strafter(?x, “:”) as ?suffix)
bind( uri(concat(?ns, ?suffix)) AS ?xuri)
}

 

A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates: