The perils of managing OWL in a version control system
March 30, 2014 25 Comments
Background
Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.
VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.
OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.
OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.
The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.
Methods
I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.
Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).
This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.
The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.
Results
The results can be downloaded HERE, and are plotted in the following
figure.
As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.
In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.
This analysis only considers a single ontology, and a single version of the OWL API.
Discussion and Conclusions
Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.
These recommendations are consistent with the size of the file in each format.
The following numbers are for Uberon:
- obo 11M
- omn 28M
- ofn 37M
- owl 53M
- ttl 58M
However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.
The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.
The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.
However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.
The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.
At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.
Footnote:
Some ongoing efforts ito mavenize ontologies:
Updates:
- The OWLAPI developers are making changes to the renderers for all OWL syntaxes that preserve ordering. Follow this issue in GitHub
- Shahim Essaid is writing some great tools for working with ontologies in GitHub