The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates:

 

 

 

 

 

Advertisements

Querying for connections between the GO and FMA

Can we query for connections between FMA and GO? This should be
possible by using a combination of

  • GO
  • Uberon
  • FMA
  • Axioms linking GO and Uberon (x-metazoan-anatomy)
  • Axioms linking FMA and Uberon (uberon-to-fma)

This may seem like more components than is necessary. However,
remember that GO is a multi-species ontology, and “heart development”
in GO covers not only vertebrate hearts, but also (perhaps
controversially) drosophila “hearts”. In contrast, the FMA class for
“heart” represents a canonical adult human heart. This is why we have
to go via Uberon, which covers similar taxonomic territory to GO. The
uberon class called “heart” covers all hearts.

GO to metazoan anatomical structures

http://purl.obolibrary.org/obo/go/extensions/x-metazoan-anatomy.owl contains axioms of the form:


'heart  EquivalentTo 'anatomical structure morphogenesis' and
'results in morphogenesis of' some uberon:heart

(note that sub-properties of ‘results in developmental progression of’
are used here)

Generic metazoan anatomy to FMA

http://purl.obolibrary.org/obo/uberon/bridge/uberon-bridge-to-fma.owl contains axioms of the form:


fma:heart EquivalentTo uberon:heart and part_of some 'Homo sapiens'

GO to FMA

Note that there is no existential dependence between go ‘heart
development’ and fma:heart. This is as it should be – if there were no
human hearts then there would still be heart development
processes. This issue is touched in Chimezie Ogbuji‘s presentation at DILS 2012.

This lack of existential dependence has consequences for querying
connections. An OWL query for:

?p SubClassOf ‘results in developmental progression of’ some ?u

Will return GO-Uberon connections only.

We must perform a join in order to get what we want:

?p SubClassOf ‘results in developmental progression of’ some ?u,
?a SubClassOf ?u,
?a part_of some ‘Homo sapiens’

Actually executing this query is not straightforward. Ideally we would
have a way of using OWL syntax, such as the above. To get complete
results, either EL++ or RL reasoning is required. In the next post I’ll present some possible options for issuing this query.

ubermouth

Jim Balhoff has written a nice image depiction plugin for Protege4. Here it is in action showing uberon’s mouth.

uberon mouths

screenshot of uberon/depictions.owl using image depiction plugin

The plugin assumes that images are represent as individuals of type foaf:depicts some <Class>. For example:

Individual: wc:thumb/0/06/Mouth_illustration-Otis_Archives.jpg/180px-Mouth_illustration-Otis_Archives.jpg
Types: foaf:depicts some :UBERON_0000165

The plugin is available from github. You can try it on the uberon depictions owl file, http://purl.obolibrary.org/obo/uberon/depictions.owl.

Images are stored in a somewhat hacky way in uberon right now – as xrefs. There is a hacky way to translate them into the correct OWL – in future they will be stored directly with explicit OWL semantics. We will also include additional metadata about the image; for example (with IDs replaced by labels):

Individual: wc:180px-Mouth_illustration-Otis_Archives.jpg
Types: depicts some ('mouth' and part_of some 'Homo sapiens')
Annotations: description "Medical illustration of a human mouth by Duncan Kenneth Winter. Part of an unpublished manuscript on medical illustration written by Winter."

Individual: uberon/images/lamprey_sucker_rosava_3238889218.jpg
Types: depicts some ('mouth' and part_of some Petromyzontida)

Jim’s plugin makes use of the reasoner, so these species-specific depictions would show up in the generic uberon “mouth” class (unfortunately Elk0.2 doesn’t support individuals, and a fast reasoner like Elk is required for Uberon – however, Elk0.3, due very soon, should support individuals).

Many of the images in uberon were derived automatically by dbpedia SPARQL queries and may not have been verified. Whilst probably SFW, some of the depictions may be a little racy, so exercise caution whilst poking around the nether regions! The images in wikipedia are obviously human centric – it would be nice to have more sources for other animals. If anyone knows any sources that would be easy to mark up let me know.

Elk disjoint hack

Elk is a blindingly fast EL++ reasoner. Unfortunately, it doesn’t yet support the full EL++ profile – in particular it lacks disjointness axioms. This is unfortunate, as these kinds of axioms are incredibly useful for integrity checking. See the methods section of the Uberon paper for some details on how partwise disjointness axioms were created.

However, Elk does support intersection and equivalence. This means we should be able to perform a translation:

DisjointClasses(x1, x2, …, xn) ⇒
EquivalentClasses(owl:Nothing IntersectionOf(xi xj)) for all i<j<=n

I asked about this on the Elk mail list – see  Satisfiability checking and DisjointClasses axioms

The problem is that whilst Elk supports intersection and equivalence, it doesn’t support Nothing. This means that there may be corner cases in which it doesn’t work.

Proper disjointness support may be coming in the next version Elk, but it’s been a few months so I decided to go ahead and implement the above translation in OWLTools (also available in Oort).

If we have an ontology such as foo.owl:

Ontology: <http://example.org/x.owl>

Class: :reasoner
Class: :animal
  DisjointWith: :reasoner

Class: :elk
  SubClassOf: :reasoner, :animal

We can translate it using owltools:

owltools foo.owl --translate-disjoints-to-equivalents -o file://`pwd`/foo-x.owl

Remeber, ordering of arguments is significant in owltools -make sure you translate *after* the ontology is loaded.

And then load this into Protege and reason over it using Elk. As expected, “elk” is unsatisfiable:

You can also do the checking directly in owltools:

owltools foo.owl --translate-disjoints-to-equivalents --run-reasoner -r elk -u

The “-u” option will check for unsatisfiable classes and exit with a nonzero code if any are found, allowing this to be used within a CI system like Jenkins (see this previous post).

You can also use this transform within Oort (command line version only):

ontology-release-runner --translate-disjoints-to-equivalents --reasoner elk foo.owl

Remember, there are corner cases where this translation will not work. Nevertheless, this can be useful as part of an “early warning” system, backed up by slower guaranteed checks running in the background with HermiT or some other reasoner.

Perhaps the ontologies I work with have a simpler structure, but so far I have found this strategy to be successful, identifying subtle part-disjointness problems, and not giving any false positives. There don’t appear to be any scalability problems, with Elk being its usual zippy self even when uberon is loaded with ncbitaxon/taxslim and taxon constraints translated into Nothing-axioms (~3000 disjointness axioms).

 

Taxon constraints in OWL

A number of years ago, the Gene Ontology database included such curiosities as:

  • A slime mold gene that had a function in fin morphogenesis
  • Chicken genes that were involved in lactation

These genes would be pretty fascinating, if they actually existed. Unfortunately, these were all annotation errors, arising from a liberal use of inference by sequence similarity.

We decided to adopt a formalism specified by Wacek Kusnierczyk[1], in which we placed taxon constraints on classes in the ontology, and used these to detect annotation errors[2].

The taxon constraints make use of two relations:

 

You can see examples of usage in GO either in QuickGO (e.g. lactation) , or by opening the x-taxon-importer.owl ontology in Protege. This ontology is used in the GO Jenkins environment to detect internal consistencies in the ontology.

The same relations are also in use in another multi-species ontology, Uberon[3].

 

In uberon, the constraints are used for ontology consistency checking, and to provide taxon subsets – for example, aves-basic.owl, which excludes classes such as mammary gland, pectoral fin, etc.

Semantics of the shortcut relations

In the Deegan et al paper we described a rule-based procedure for using the taxon constraint relations. This has the advantage of being scalable over large taxon ontologies and large gene association sets. But a better approach is to encode this directly as owl axioms and use a reasoner. For this we need to use OWL axioms directly, and we need to choose a particular way of representing a taxonomy.

Both relations make use of a class-based representation of a taxonomy such as ncbitaxon.owl or a subset such as taxslim.owl.

We can treat the taxon constraint relations as convenient shortcut relations which ‘expand’ to OWL axioms that capture the intended semantics in terms of a standard ObjectProperty “in_organism”. For now we leave in_organism undefined, but the basic idea is that for anatomical structures and cell components “in_organism” is the part_of parent that is an organism, whereas for processes it is the organism that encodes the gene products that execute the process.

In fact there are two ways to expand to the “in_organism” class axioms:

The more straightforward way:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism only ?Y
?X never_in_taxon ?Y ===> ?X SubClassOf in_organism only not ?Y

To achieve the desired entailments, it is necessary for sibling taxa to be declared disjoint (e.g. Eubacteria DisjointWith Eukaryota). Note that these disjointness axioms are not declared in the default NCBITaxon translation.

A different way which has the advantage of staying within the OWL2-EL subset:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism some ?Y
?X never_in_taxon ?Y ===> ?X DisjointWith in_organism some ?Y

This requires all sibling nodes (A,B) in the NCBI taxonomy to have a
General Axiom:

in_organism some ?A DisjointWith in_organism some ?B

These general axioms are automatically generated and available in taxslim-disjoint-over-in-taxon.owl

Taxon groupings

GO also makes use of taxon groupings – these include new classes such as “prokaryotes” which are defined using UnionOf axioms.. They are available in go-taxon-groupings.owl.

Taxon modules

One of the uses of taxon constraints is to build taxon-specific subsets of ontologies. This will be covered in a future post.

References

  1. Waclaw Kusnierczyk (2008) Taxonomy-based partitioning of the Gene Ontology, Journal of Biomedical Informatics
  2. Deegan Née Clark, J. I., Dimmer, E. C., and Mungall, C. J. (2010). Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development. BMC Bioinformatics 11, 530
  3. Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E., and Haendel, M. A. (2012) Uberon, an integrative multi-species anatomy ontology Genome Biology 13, R5. http://genomebiology.com/2012/13/1/R5