A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

A lightweight ontology registry system

For a number of years, I have been one of the maintainers of the registry that underpins the list of ontologies at the Open Biological Ontologies Foundry/Library (http://obofoundry.org). I also built some of the infrastructure that creates nightly builds of each ontology, verifying it and providing versions in both obo format and owl.

The original system grew organically and was driven by an ultra-simple file called “ontologies.txt“, stored on google code. This grew to be supplemented by a collection of other files for maintaining the list of issue trackers, together with additional metadata to maintain the central OBO builds. The imminent demise of google code and the general creakiness and inflexibility of the old system has prompted the search for a new solution. I wanted something that would make it much easier for ontology providers to update their information, but at the same time allow the central OBO group the ability to vet and correct entries. We needed something more sophisticated than a flat key-value list, yet not overly complex. We also wanted something compatible with semantic web standards (i.e. to have an RDF file with a description of every ontology it it, using standard vocabularies and ontologies for the properties and classes). We also wanted it to look a bit nicer than the old site, which was looking decidedly 2000-and-late.

Screen Shot 2015-08-26 at 7.57.26 PM

The legacy OBOFoundry site, looking dated and missing key information

What are some of the options here?

  • A centralized wiki, with a page for each ontology, and groups updating their entry on the wiki
  • Each group embeds the metadata about the ontology in a website they maintain. This is then periodically harvested by the central registry. Options for embedding the metadata include microdata and RDFa
  • Each group maintains their own metadata in or alongside their ontology in rdf/owl, and this is periodically harvested
  • Piggy back off of an existing registry, e.g. BioPortal
  • A bespoke registry system, designed from the ground up, with its own relational database underpinning it, etc

These are good all solutions in the appropriate context, but none fitted our requirements precisely. Wikis are best for unstructured or loosely structred narrative text, but attempts to embed structured information inside wikis have been less than satisfactory. The microdata/RDFa approach is interesting, but not practical for us. Microdata is inherently limited in terms of extensibility, and RDFa is complex for many users. Additionally it requires both that groups produce their own web sites (many rely on the OBO Foundry to do this for them), and that we both harvest the metadata and relinquish control. As mentioned previously, it is useful for the OBO repository administrators to have certain fields be filled in centrally (sometimes for policy reasons, sometimes technical).  The same concerns underpin the fully decentralized approach, in which every group maintains the metadata directly as part of the ontology, and we harvest this.

Existing registries are built for their own requirements. A bespoke registry system is attractive in many ways, as this can be highly customized, but this can be expensive and we lacked the resources for this.

Solution: GitHub pages and “YAML-LD”

I initially prototyped a solution making use of the GitHub pages framework, driven by YAML files. This can be considered a kind of bespoke system, contradicting what I said above. But rather than roll the entire framework, the system is really just some templates glueing together some existing systems. GitHub support for social coding and YAML helped a lot. The system was very quick to develop and it soon morphed into the actual system to replace the old OBO site.

YAML

YAML is a markup language that superficially resembles the tag-value stanza format we were previously using, but crucially allows for nesting. Here is an example of a snippet of YAML for a cephalopod ontology:

id: ceph
title: Cephalopod Ontology
contact:
  email: cjmungall@lbl.gov
  label: Chris Mungall
description: An anatomical and developmental ontology for cephalopods
taxon:
  id: NCBITaxon:6605
  label: Cephalopod

Note that certain tags have ‘objects’ as their fields, e.g. contact and taxon.

We stick to the subset of YAML that can be represented in JSON, and we can thus define a JSON-LD context, allowing for a direct translation to RDF, which is nice. This part is still being finalized, but the basic idea is that keys like ‘title’ will be mapped to dc:title, and the taxon CURIE will be expanded to the full PURL for cephalopoda.

The basic idea is to manage each ontologies metadata as a separate YAML file in a GitHub repository. GitHub features nice builtin YAML rendering, and files can be edited via the GitHub web-interface, which is YAML-aware.

The list of metadata files are here. Note that these are markdown files ( the .md stands for markdown, not metadata). YAML can actually be embedded in Markdown, so each file is a mini-webpage for the ontology with the metadata embedded right in there. This is in some ways similar to the microdata/RDFa approach but IMHO much more elegant.

GitHub Pages

Each markdown file is rendered attractively through the GitHub interface – for example, here is the md file for the environment ontology, rendered using the builtin GitHub md renderer. Note the yaml block contains structured data and the rest of the file can contain any mixture of markdown and HTML which is rendered on the page. We can do better than this using GitHub pages. Using a simple static site generator and templating system (Jekyll/liquid) we can render each page using our own CSS with our own format. For example here is ENVO again, but rendered using Jekyll. Note that we aren’t even running our own webserver here, this is all a service provided for us, in keeping with our desire to keep things lightweight and resource-light.

Screen Shot 2015-08-26 at 10.43.30 PM

The entire system consists of a few HTML templates plus a single python script that derives an uber-metadata file that powers the central table (visible on the front page).

Distributed editing

Where the system really shines is the distributed and social editing model. All of this comes for free when hosted on GitHub (in theory GitLab or some other sites should work). Anyone can come along and fork the OBOFoundry.github.io github repository into their own userspace and make edits – they can even do this without leaving their web browser (see the Edit button on the bottom left of every OBO ontology page).

What’s to stop some vandal trashing the registry? Crucially, any edits made by a non-owner remains in their own fork until they issue a Pull Request. After that, someone from OBO will come along and either merge in the pull request, or close it (giving a reason why they did not merge of course). The version control system maintains a full audit trail of this, premature merges can be rolled back, etc.

The task of the OBO team is made easier thanks to Travis-CI, a Continuous Integration system integrated into GitHub. I configured the OBOFoundry github site with a Travis configuration file that instructs Travis to check every pushed commit using an automated test suite – this ensures that people editing their yaml files don’t make syntax errors, or omit crucial metadata.

github merge page with travis check

Screenshot of GitHub pull request, showing a passed Travis check

I have previously written about the use of Continuous Integration in ontology development – although CI was developed primarily for software engineering products, it works surprisingly well for ontologies and for metadata. This is perhaps not surprising if we consider these engineered artefacts in the way software is.

The whole end-to-end process is documented in this FAQ entry on the site.

The system has been working extremely well and is popular among the ontology groups that contribute their expertise to OBO – before official launch of the new site, we had 31 closed pull requests. Whereas previously a member of the OBO team would have to coordinate with the ontology provider to enter the metadata (a time consuming process prone to errors and backlogs), now the provider has the ability to enter information themselves, with the benefit of validation from Travis and the OBO team.

Other features

The new site has many other improvements over the last one. It’s not possible to distinguish between the ontology sensu the the umbrella entity vs individual ontology products or editions. For example, the various editions of Uberon (basic, core, composite metazoan) can each be individually registered and validated. There are also a growing number of properties that can be associated with the ontology, from a twitter handle to logos to custom browsers. Hopefully some of these features will be useful to the OBO community. Of course, the overall look could still be massively improved easily by someone with some web design chops (it’s very bland generic bootstrap at the moment). But this isn’t really the point of this post, which is more about the application of a certain set of technologies to allow a balance between centralization and distributed editing that suits the needs of the OBO Foundry. Leveraging existing services like GitHub pages, Travis and the GitHub fork-and-pull-request model allows us to get more mileage for less effort.

The future of metadata

The new OBO site was inspired in many ways by the system developed by my colleague Jorrit Poelen for the Global Biotic Interactions database (GloBI), in which simple JSON metadata files describing each interaction dataset are provided in individual GitHub repositories. A central system periodically harvests these into a large searchable index, where different datasets are integrated. This is not so different from common practice among software developers, who provide metadata for their project in the form of pom.xml files and package.json files (not out of their love of metadata, but more because this provides a useful service or is necessary for working in a wider ecosystem, and integrating with other software components). As James Malone points out, it makes far more sense to simply pull this existing metadata rather than force developers to register in a monolithic rigid centralized registry. If there are incentives  for providers of any kind of information artefacts (software, ontologies, datasets) to provide richer metadata at source in large already-existing open repositories such as GitHub then it does away with the need to build separately funded large monolithic registries. The new OBO system and the GloBI approach are demonstrating some of these incentives for ontologies and datasets. The current OBO system still has a large centralized aspect, due in part to the nature of the OBO Foundry, but in future may become more distributed.


Chris Mungall

A response to “Unintended consequences of existential quantifications in biomedical ontologies”

In Unintended consequences of existential quantifications in biomedical ontologies, Boeker et al attempt to

…scrutinize the OWL-DL releases of OBO ontologies to assess whether their logical axioms correspond to the meaning intended by their authors

The authors examine existential restriction axioms in a number of ontologies (whose source is in obo-format) and rate them according to the correspondence between the semantics and the presumed author intent. They claim:

  • usability issues with OBO ontologies
  • lack of ontological commitment for several common terms
  • proliferation of domain-specific relations
  • numerous assertions which do not properly describe the underlying biological reality, or are ambiguous and difficult to interpret.

The proposed solution:

The solution is a better anchoring in upper ontologies and a restriction to relatively few, well defined relation types with given domain and range constraints

I think this is an interesting paper, and have great respect for all the authors involved. However, I find some of the claims to be suspect and need countered. I do think the paper shows that we need much better ontology and ontology-technology documentation from the obo foundry effort (which I am a part of); however, I think the authors have read far too much into the lack of documentation and consequently muddy the issues on a number of matters.

The initial misunderstanding is presented at the start of the paper:

This extract asserts the relationship part_of between the terms ankle and hindlimb in OBO format.

[Term]
id: MA:0000043
name: ankle
relationship: part of MA:0000026 ! hindlimb

This assertion does not commit to a semantics in terms of the real world entities which are denoted by the terms. It does not allow us to infer that, e.g., all hindlimbs have ankles, or all ankles are part of a hindlimb. Descriptions at this level require some kind of ontological interpretation for the OBO syntax in terms of OWL axioms, as OWL axioms are explicitly quantified

In fact this is incorrect. There is an ontological interpretation for the OBO syntax in terms of OWL axioms (which the authors provide, falsely stating that it is “one such interpretation”):

Ankle subClassOf part_of some Hindlimb

The authors provide links to official documentation confirming that this is the correct interpretation. They then go on to say:

Our mouse limb example could therefore be alternatively translated into at least the following three OWL expressions:

(i) Ankle subClassOf part_of some Hindlimb

(ii) Ankle subClassOf part_of exactly 1 Hindlimb

(iii) Ankle subClassOf part_of only Hindlimb

In fact there is some legitimate confusion over interpretation of relations due to the impedance mismatch between the treatment of time in the 2005 Relations Ontology paper and what is possible in OWL. But positing additional unwarranted interpretations just muddies the waters. In fact, regardless of the time issue, the RO 2005 paper is quite clear that the relations used should be read in an all-some fashion (ie interpretation (i)). This is consistent with what the Goldbreich/Horrocks translation and its current successor the obof1.4 specification, all of which are cited by the authors.

This claimed lack of a standard interpretation informs the main thesis advanced by the authors: the translation of obo-format relationships to existential restrictions is not always what ontology authors intend. In fact they are testing for something stronger, specifically the claim that every such translated existential restriction implies existential dependence, where this is defined:

x dependsG for its existence upon Fs = df

Necessarily, x exists only if some F exists

It is worth noting that dependence claim they are testing is a strong one, is stronger than anything in the OWL semantics and would be violated by a number of other ontologies, many in OWL such as the NCIt due to the prevalence of “may_do_X” type relations. This is a subtle point that may escape the casual reader of the paper.

The authors examined axioms in a number of ontologies and evaluated them to see whether there were uses of existential restrictions where this strong dependence claim is not justified. Their test set included ontologies from the OBO library as well as a number of external support ontologies (aka “cross product” ontologies). Most of these ontologies currently use obo-format as their source. They did not invite external domain experts, and they did not check their results with the authors of the ontologies.

The authors provide examples where they believe there are unintended consequences of existential restrictions, based on this strong interpretation. Many of the examples they provide are problematic, as I will illustrate.

They provide this example from the GO:

“Interkinetic nuclear migration SubClassOf

part_of some Cell proliferation in forebrain

The ontological dependence expressed by this assertion is that there are no interkinetic nuclear migration processes without a corresponding cell proliferation in forebrain process. This is obviously false, since interkinetic nuclear migration is a very fundamental cell process, which is not limited to forebrains. An easy fix to this error is the inversion of the expression by using the inverse relationship:

Cell proliferation in forebrain subclassOf

has_part some Interkinetic nuclear migration”

In fact, the GO editors are well aware of the all-some interpretation, they did intend to say that all instances of IKNM are in a forebrain, this is clear from the textual definition (I have highlighted the relevant part):

[Term]
id: GO:0022027
name: interkinetic nuclear migration
def: “The movement of the nucleus of the ventricular zone cell between the apical and the basal zone surfaces. Mitosis occurs when the nucleus is near the apical surface, that is, the lumen of the ventricle.” [GO_REF:0000021, GOC:cls, GOC:dgh, GOC:dph, GOC:jid, GOC:mtg_15jun06]
is_a: GO:0051647 ! nucleus localization
relationship: part_of GO:0021846 ! cell proliferation in forebrain

The mistake GO has made is giving the class a misleadingly generic label. This kind of thing is not unheard of in the GO – a class is given a label that has a specific meaning to one community when in fact the label is used more generally by a wider community. This is not to understate this kind of mistake – it’s actually quite serious (annotators are meant to always read the definition but unfortunately this rule isn’t always followed). However, the problem is entirely terminological and not in any way related interpretations of the relationship tag or existential quantification. The creators of this class really did intend to restrict the location to the forebrain (This was confirmed by one of the GO editors listed as provenance for the definition).

The authors are on safer ground with their analysis of structural relations such as has_parent_hydride in CHEBI. I don’t have such a problem here, but it would have been useful to see the claims tested. Can we use a reasoner to determine an inconsistency in the ontology (supplemented with additional axioms) using a reasoner? It seems that the problem is less in the computational properties of the existential restriction, and more in the existential dependence claim (which, remember, is stronger than what is claimed by the OWL semantics).

They also cover what they perceive to be a problem with the use of existential restrictions in conjunction with what BFO calls “realizables”:

A statement such as

Anisotropine methylbromide subclassOf has_role some Anti-ulcer drug

in ChEBI asserts that each and every anisotropine methylbromide molecule has the role of an anti-ulcer drug. However, this role may never be realized for a particular molecule instance, since that molecule may play a different role in the treatment of a different disease, or play no role at all. It is thus problematic to assert an existential dependence between the molecule and the realization of the role (in the treatment of an ulcer)

This is a reasonable philosophical analysis. But are there actually any negative consequences for a user of the ontology or for reasoning? Does it lead to any incorrect inferences? I’m not convinced that an existential restriction is so wrong here. The problems uncovered with this example are really to do with some obscure conditions on bfo roles (ie all roles are realized – if roles were like dispositions this would not be a problem) and to be fair on the CHEBI people they might not have been aware of that when they made the axiom (BFO needs better more user-friendly documentation).

The same “problem” is uncovered with some of the GO MF cross products, but this time the mistake lies with the authors. They say:

This is particularly apparent in the Gene Ontology molecular function ontology. For example, the statement

tRNA sulfurtransferase subClassOf

has_input some Transfer RNA

asserts a dependency of every instance of tRNA sulfurtransferase on some instance of Transfer RNA. Functions include the possibility that the bearer of a function is never involved in any process that realizes the function, thus may never have input molecules. This kind of error predominates in the Cross Product sample, especially in the cross product ‘GO Molecular Function X ChEBI’. Interrater agreement was low here because of two conflicting positions: (1) the assertion is false, because functions can remain unrealized, or (2) the assertion is true, but the categorization as a function is false, as implied by the suffix “activity”.

In fact this latter interpretation (2) is the correct one. The term in the GO is “tRNA sulfurtransferase activity“. Now, the authors do have a good point here – the ontological commitment of GO towards BFO was unclear here (this is now made more explicit with an ontology of bridging axioms that make “catalytic acitivity” a subclass of bfo:process – but note this is still controversial with the BFO people). With the correct intepretation, the authors statement that “This kind of error predominates in the Cross Product sample” is not supported. The authors have simply jumped to the conclusion that everything in GO MF must be a BFO function based on the name of the ontology (which was named by biologists, not philosophers) and extrapolated from this that the ontology is full of errors, in particular unintended consequences of existential restrictions.

Interestingly, whilst focusing on the inconsequential problem of violation of existential dependence, they missed the real unintended consequences. In fact there is a lurking closed world assumption with all of these GO MF logical definitions, and OWL is open world! Each of the reactions that’s defined in terms of inputs and outputs should explicitly state the cardinality of all participants, and in addition there needs to be a closure axiom to say there are no additional inputs or outputs! Unlike the philosophical problem of positing existential dependence between a function and a (potential) continuant (which is not even a problem here, as the reactions are intended to be interpreted as bfo:processes),  this gives results that are empirically wrong! So there was a real serious example of unintended consequences that were missed.

It’s important to bear in mind that some of these cross-product files are separate from the main ontology, not fully official not of as high quality as the main ontology. The GO BP chebi ones are maintained by the GO editors and are high quality, but the definitions for GO MF reactions were created by me, semi-automatically, and not as high quality. This draws attention to the need for better documentation here – if the paper had simply criticized the lack of documentation and clear commitment to BFO they would have been spot on, but instead they use this to make spurious claims.

The authors are on shaky ground with anatomy ontologies again:

Time dependencies

These are commonly expressed in ontologies encoding development or other time-dependent processes. Kinds of participation in such time dependent processes can be difficult to pin down as can the exact ontological dependence between the process and the material entities. The start and end relations are intending to express just such time dependencies to do with the development of anatomical structures.

Pharyngeal endoderm subClassOf

end some

Pharyngula:Prim-15 Roof plate rhombomere 5

subClassOf

start some Segmentation:10-13 somites

However, the stages of development mentioned may not be complete before the material entity comes fully into existence. They also may not be complete when the material entity stops existing. It is difficult to claim a processual entity (which extends over time) is ontologically necessary for a material entity to exist (the claim of existential dependence) unless the material entity was a clear output of this process. The solution here is, again, to substitute existential restriction by value restriction, such as

Pharyngeal endoderm subClassOf

end only Pharyngula:Prim-15

It’s difficult to see a real problem here. So what if the stages are not completed? The authors’ problem is now with the generated OWL but with the strong existential dependence claim: “It is difficult to claim a processual entity (which extends over time) is ontologically necessary for a material entity to exist (the claim of existential dependence) unless the material entity was a clear output of this process“. To which I would reply: so what? The authors are testing a claim that is too strong. The OWL is correct, and the OWL does not make any claims about existential dependence, that claim is in the authors’ minds. It’s difficult to see any practical problems with the OWL representation of the ZFA relations here. If the problem is purely philosophical, this should be published in a philosophical journal.

Furthermore, the supposed correct solution to this non-problem is terrible: using a universal restriction means fixing at a single artificial level of granularity for the stages (i.e we can’t have a property chain end o part_of -> end, which would lead to incorrect inferences).

Again, there are some real problems with some ontologies hidden here:

  • the start and end relations are undefined. There are a number of people working on this, but admittedly it’s lame there is no standard solution in place yet. We should at least have some better documentation for start/end.
  • there are a lot of hidden assumptions in OBO ontologies regarding how applicable each ontology is for the full range of variation found in nature. The FMA has a whole story about “canonical anatomy” here. For many biological ontologies there’s a shared understanding between authors and users that the ontology represents a simplified reference picture, and in fact there may be the occasional zebrafish pharyngeal endoderm that ends a bit after or a bit before the prim-15 stage. We the OBO Foundry could be doing more to ensure this is explicit

If the authors had highlighted this I would have agreed wholeheartedly and apologized for the current state of affairs. However, this doesn’t really have anything to do with “unintended consequences of existential quantifications”. It’s just a plain case of lack of documentation (not that this is excusable, but the point is that the paper is not titled “lack of documentation in biomedical ontologies”)

Finally, the authors also include a discussion of the use of existential restrictions in conjunction with relations such as lacks_part. This part is fairly reasonable but most of it has been said in other publications. There are actually some subtleties here when it comes to phenotype ontologies, but this is best addressed in a separate post. There is a solution partly in place now involving shortcut relations, but this wasn’t mature when the authors wrote the paper, so fair enough.

Overall I wasn’t convinced by these results. The results were not externally validated (this would have  been easy to do – for example, by contacting the ontology authors or pointing out the error on a tracker) and relied on subjective opinions of the authors (even then they largely did not agree). In addition, the relationships were being tested for existential dependence, and it’s no surprise that continuant-stage relationships don’t conform to this, nor is this a problem.

Based on these results, the authors go on to conclude:

Our scrutiny of the OBO Foundry candidate ontologies and cross products yielded a relatively high proportion of inappropriate usages of simple logical constructors. Only focusing on the proper use of existential restriction in class definitions, we found up to 23% of unintended consequences in these constructions. Many Foundry ontologies are widely used throughout the biomedical domain, and therefore such a high error rate seems surprising.

[my emphasis]. To a casual reader the “23%” sounds terrible. But remember:

  • the authors made mistakes in their evaluation – e.g. with GO
  • the authors over-interpreted in many cases, leading to inflation of numbers
  • in the case of roles, the problem is really in adhering to the BFO definition
  • the unintended consequences are largely philosophical consequences regarding existential dependence rather than consequences that would manifest computationally in reasoning.

The last sentence is telling:

Many Foundry ontologies are widely used throughout the biomedical domain, and therefore such a high error rate seems surprising

Indeed, it would be surprising if this were the case. The ZFA is used every day to power gene expression queries on the ZFIN website. Why haven’t any of their users cottoned on these errors? It’s a mystery. Or perhaps not. Perhaps the authors are seeing errors when there are in fact none.

Most of the existential restrictions are in fact, contrary to what the authors claim, intended to be existential restrictions. In some cases, such as the “smoking may cause cancer” type examples, the problems only exist on a philosophical level, and even then if you make certain philosophical assumptions. Saying smoking “SubClassOf may_cause some cancer” would be an example of an unintended consequence according to the authors, because it implies that every instance of smoking is existentially dependent on some instance of cancer, which is philosophically problematic (to some). Nevertheless, it’s well known many people working with OWL ontologies use this idiom because there’s no modal operator in OWL, it’s practical and gives the desired inferences. See What causes pneumonia for more discussion.

The authors go on:

We hypothesize that the main and only reason why this has little affected the usefulness of these ontologies up to now is due to their predominant use as controlled vocabularies rather than as computable ontologies. Misinterpretations of this sort can cause unforeseeable side effects once these ontologies are used for machine reasoning, and the use of logic-based reasoning based on biomedical ontologies is increasing with the advent of intelligent tools surrounding the adoption of the OWL language.

In fact it would be easy to test this hypothesis. If it were true, then it should be possible to add biologically justifiable disjointness axioms to the ontology and then use a reasoner to find the unsatisfiable classes that arise from the purported incorrect use of existential restrictions. It is a shame the authors did not take this empirical approach and instead opted for a more subjective ranking approach.

In fact the transition from weakly axiomatized ontologies to strongly axiomatized ones is happening, and this is uncovering a lot of problems through reasoning. But the problems being uncovered are generally not due to unintended consequences of existential quantifications. The authors widely miss the mark on their evaluation of the problem.

But the authors do end with an excellent point:

Another problem that hindered our experiments is the unclear ontological commitment of many classes and relations in OBO ontologies, which makes it nearly impossible to reach consensus about the truth-value of many of their axioms. This involves not only ambiguities in ontological interpretation of the classes included in the ontologies but also the proliferation of relations which were poorly defined. To address this shortcoming, ontologies can rely on more expressive languages and axiom systems in which the intended semantics of the relations used are constrained, as is done for the OBO relation ontology

The only objection I have there is to point out that most OBO ontologies don’t use a proliferation of relations – the authors are referring to some of the cross-product extensions here. But point taken – some relations need better definitions,  the cross product files are of variable quality and known issues should be documented.

If this were the thesis off the paper I would have less of a problem. However, the paper makes a stronger claim, namely that 23% of the existential restrictions are wrong and should be changed to some other logical constructor (with the implication that this is due to ambiguities in obo-format). Two (hopefully unintended) consequences of this paper are muddying the waters on the semantics of obo-format and spreading misinformation about the quality of relational statements in the OBO library.

This needs to be countered:

  • The official semantics of OBO-Format are such that every relationship tag is by default interpreted as a subclass of an existential restriction. This can be overridden in some circumstances, but in practice is rarely done, see http://oboformat.org/
  • If something is not clear, ask on the obo-format list rather than basing a paper on a misunderstanding
  • Most obo-format ontology authors have sufficient understanding of the all-some semantics such that you should trust the OWL that comes out the other end.
  • If you don’t trust it, then report the problem to the ontology via their tracker.
  • If you think you’ve uncovered systematic errors in either the underlying ontology or in the translation to OWL, verify there really is an error using the appropriate mechanism (e.g. trackers) before jumping to conclusions and writing a paper falsely claiming a 23% error rate. There are in fact many problems with many ontologies, but unintended consequences of existential quantification is are not among them, except in a small number of cases (e.g. CHEBI), which have yet to be shown to cause any harm, but nevertheless need better documentation.

 

The size of Richard Nixon’s nose, part II

In part 1 we saw how to encode a “big nose phenotype” in such a way that it was neutral with respect to the path the class expression takes through the object graph, subsuming all of:

  • any entity with a nose that has the characteristic of being big
  • anything that exhibits a bigness that is a characteristic of a nose

Thus masking over the distinctions inherent in a formal ontological representation.

We can take this one step further and make our big nose phenotype encompass the nose itself, and its own bigness characteristic. The simplest way to do this would be to make the relation exhbits reflexive – either with a direct reflexivity characteristic, or a local reflexivity general axiom:

Thing SubClassOf exhibits some Self

Unfortunately this runs afoul of DL expressivity constraints. Fortunately, there is a trick at hand. A really gnarly one, but it works.

First of all we have to declare a “fake” relation – let’s append SELF onto the end:

ObjectProperty: :exhibitsSELF
SubPropertyOf: :exhibits

Now we make this reflexive:

Class: owl:Thing
SubClassOf:
:exhibitsSELF some Self

This is legal, as exhibitsSELF is a “simple” object property. Finally, we add the following:

ObjectProperty: :exhibitsSELF
SubPropertyOf: :exhibits

We have sneaked our reflexivity constraint in via a fake relation. It’s a shame that all this obfuscating machinery is required to do this, it would be nice if there were some OWL syntactic sugar.

We can do the same thing for has_part, which is traditionally reflexive:

ObjectProperty: :has_partSELF
SubPropertyOf: :has_part
ObjectProperty: :has_part
SubPropertyOf: :exhibits
Characteristics: Transitive

With that in place we can revisit our test probe classes from last time:

Class: :test1
EquivalentTo: :exhibits some (:big and :characteristic_of some :nose)

Class: :test2
EquivalentTo: :exhibits some (:has_part some (:nose and :has_characteristic some :big))

Class: :test3
EquivalentTo: :exhibits some (:nose and :has_characteristic some :big)

Class: :test4
EquivalentTo: :has_part some (:nose and :has_characteristic some :big)

Now the inferred hierarchy looks like this:

test1=test2=test3
--test4

And if we examine our 3 individuals, we see they classify as follows:

  • nixon : test1, test2, test3
  • nixons_nose : test1, test2, test3
  • nixons_nose_size: test1, test2, test3, test4

So using the exhibits relation we can encode a very general notion of phenotype, that of exhibiting some characteristic, which classifies either the organism, the affected part, or the characteristic itself.

The machinery is rather arcane though, and does require stepping outside the EL subset of OWL. In general, it is of course better to decide on a single form. Unfortunately, no one form satisfies all purposes.

An organism-centric representation is intuitive and simple. If the instances you’re classifying are organisms (e.g. humans with disorders, mutant fruitflies, rare butterfly specimens) then this works very well. It also makes it easy to represent “composite” phenotypes such as “organism with big nose and sweaty palms”. However, if we take this to the step of equating the phenotype with this representation, then we have the curious situation where the organism is the phenotype rather than the organism has a phenotype or phenotypes. If we conceive of phenotype as entirely a class level thing, then we have one organism instantiating multiple phenotypes,  but we should be clear that in this model the relationship between the phenotype instances and organism instance is identity.

A organism-part centric view is also intuitive and simple. For example “nose and has_characteristic some big”. But note the entailments we get from this – an abnormally big nose is part of an abnormal head, but it’s not a subclass of an abnormal head. This is in contrast to the relation we expect to have between the corresponding phenotypes, which is a subclass relationship (on the evidence of all pre-coordinated phenotype ontologies). So this representation is absolutely fine, but we should be clear that we are representing anatomy (perhaps variant anatomy in particular) rather than phenotypes – the relationship between the two may be trivial, and glossed over using the exhibits pattern above. But for modeling phenotypes ontologically we have to be clear about the distinction.

A characteristic-centric view is perhaps the most unintuitive – it asks us to believe in characteristics/qualities as individuals in the world, which is perfectly fine in the BFO ontology, but people may still have a hard time conceiving of this, in contrast to the more “physical” class expressions above. However, it offers distinct advantages. It allows us to talk directly about the characteristic itself – e.g. the dysplastic characteristic of John’s heart was due to the presence of a particular sequence in his Shh genes. If we try and switch this around we get into trouble; eg. if we equate the “dysplastic heart” phenotype with a class expression “heart and has characteristic dysplatic”, then we say that this phenotype arises from a Shh mutation, we lose the fact that the “dysplaticity” is the characteristic we care about, rather than any of the other characteristics of John’s heart.

One other advantage of the characteristic-centric view is that it corresponds to a more traditional view of phenotypes as the characteristics of an organism.

We adopted the quality/character-centric view for defining the phenotypes in the MP ontology (see our Genome Biology paper) – this worked fairly well when we tested it by recapitulating asserted subclasses via reasoning. However, it worked less well when we used it for HP, which includes many composite phenotypes – e.g. “large flat nose” – these cannot be equated to any single characteristic, it is in fact two characteristics. We can get around this by equating a phenotype with either an individual characteristic, or a collection of (presumably related) characteristics. More of this on the next post on this matter….