infrastructure · OBO-Format · Ontologies · OWLAPI · Tools · Uberon

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed
Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates:

 

 

 

 

 

25 thoughts on “The perils of managing OWL in a version control system

  1. Hi Michel,

    1. I agree we should do this. It’s not hard in a theoretical way but it would be a decent chunk of work that someone needs to do, requiring a bit of rewriting, since it all works by iteration over sets at the moment.
    2. I will check out ecco, hadn’t heard of that, thanks. In our experience it’s not that hard to grok the meaning of most diffs just by looking at the ascii diff, provided it’s a diffable syntax like obo (adding labels as comments to omn would go some way to achieving this).

    1. Let me clarify – this post is primarily about the engineering and maintenance of ontologies. OBO format is deprecated for this purpose, because we no longer develop the only tool dedicated to this format, OBO-Edit. I encourage everyone developing ontologies to switch to OWL, but with a few caveats including (1) Protege lacks many of the features present in OE resulting in a serious hindrance to productivity for certain kinds of editing (2) Your diffs become incomprehensible, which from a software engineering perspective is unacceptable. For (1) we are addressing this with new plugins, and this post addresses (2).

      The next Protege build will likely incorporate the official obo parser/writer, allowing obo to be used in conjunction with Protege, but once (1) and (2) are resolved there will be no need to maintain in obo.

      OBO format will likely always be available for all ontologies so long as developers keep asking for it, despite attempts to get them to switch their pipelines to OWL (in GO we still get requests for the dag format developed in the 1990s, which .obo supplanted 12 or so years ago).

  2. Chris – two thoughts, from opposing ends of the problem.

    (1) There is more than one algorithm for determining the diff, and not every algorithm will be best or even only good at diff’ing every type of source. For example, Github has managed to produce a diff between images, something that unix diff will certainly be quite bad at. Git as a VCS allows specifying different algorithms; what if we had a diff algorithm that could tell the difference in axioms between two input ontologies, in a way that could be plugged into a VCS like git. We could then meaningfully diff ontologies in many formats. So rather than trying to adapt ontology formats to the limits of default diff algorithms, we could try to create better fitting diff algorithms.

    (2) To come at this from the exact opposite end, if we are considering adapting the ontology format to the limitations of diff algorithms as an acceptable path, why not go all the way. For example, have you considered maintaining ontologies in a literate programming format, such as the one created by Phil Lord, or the scowl syntax created by Jim Balhoff. I think there could be a lot gained by that, too.

    1. Hi Hilmar,

      (1) Neat, I wasn’t aware of this – is this what you’re referring to

      https://github.com/blog/1772-diffable-more-customizable-maps

      I was aware of the ability to plug in your own diff algorithm locally, but it would be lovely to have the support be server side.

      Do you know how this works under the hood? Is this just a UI embellishment or is git using a different algorithm to calculate the object files?

      How would we go about incorporating a custom algorithm within github (and not just on the client side)? There are OWL and RDF diff implementations aplenty the challenge is in how to integrate them

      (2) The problem here is tool support. I personally would love to create and maintain ontologies this way, but our domain knowledge experts don’t share my love of emacs.

      I do think there is something suboptimal in how we approach ontology development. Really we should have high level modeling done in a kind of Balhoffian or Lordian way, producing small but powerful ontologies supplemented by axioms maintained in domain-expert appropriate tools (atlases, spreadsheets, graph editors) but that’s the subject of a post for another day…

      1. Pluggable diffs are not all they are cracked up to be, I think. While this article is mostly looking at diffs as a user interface, they are also part of the critical infrastructure of the VCS. Ask yourself how many ontology development teams actually use fork-merge semantics of git or equivalent? With diffs like Chris is showing you are heading for conflict everytime.

        Which heads straight back to the user interface — fixing an conflict in an XML serialisation of OWL is really a very unpleasant experience.

      2. Phil – excellent point about resolving conflicts, I can’t believe I failed to omit this. I think that working in obo format spoils you here. Teams editing another OWL format typically have a locking system – a developer sends an email to the mail list announcing their lock, performs their edits, commits, sends an unlock message – thus avoiding gnarly conflicts.

        GO has historically never had such a system (using .obo), with multiple editors working simultaneously. This is one of the blockers moving us forward to take advantage of OWL more. It’s not unusual for experimental work to be done on branches in GO – here care has to be taken in merging back, this is true of any syntax.

      3. Phil – you are correct with the merge issue, but git does have pluggable merge strategies as well.

        Chris – I haven’t had the time yet to understand the plugin interfaces in git. Could be a really nice summer internship project if a git developer were to mentor it.

      4. Chris — I’m very familiar with the “obi is locked” emails for that ontology. I mean, RCS was a fantastic tool in it’s time, but recreated this form of workflow with SVN — well, it can’t be the way to go.

        drycafe — “you are correct with the merge issue, but git does have pluggable merge strategies as well”.

        Probably it does — but then will the tools that use git all support this? I mean, I use git command line rarely, and magit (cause I am an Emacs junkie) often. And magit shows me the output of diff so I can stage chunks and stuff. Likewise merge tools.

        We can adapt software tools to fit ontology development as is. Or we can adapt ontology development to fit (many) software tools. Is a line-orientated, deterministitically serialized syntax really too much to ask?

        Apologies if this has come in the wrong thread — in my hands wordpress is not giving a reply button on comments more than 2 deep.

      5. Chris — “The problem here is tool support. I personally would love to create and maintain ontologies this way, but our domain knowledge experts don’t share my love of emacs.”

        This is a big problem of course. I mean, the whole point of Tawny is to enable an programmatic approach to ontology development and that includes the tools. Great for me (also an Emacs geek, but any IDE will do); not so good for people who don’t program in the first place. Git is a seriously good tool; but would you seriously expect normal humans to use it?

        The flippant answer to this, of course, is that if domain experts want to be involved in building large computationally amenable models, they have to learn the technology. Or, another answer is that it’s only designed for programmers.

        I think, though, there is a middle way.

        Chris — “Really we should have high level modeling done in a kind of Balhoffian or Lordian way, producing small but powerful ontologies supplemented by axioms maintained in domain-expert appropriate tools (atlases, spreadsheets, graph editors) but that’s the subject of a post for another day”

        Which is this. I have a (trivial, proof-of-concept) part in tawny where i18n happens with properties files — so your translator needs to go no where near your ontology. In the ideal world, we’d use a format that plugs directly into computational translation tools.

        And this is, I think, part of the solution. You would use tawny to define the centre of the ontology, as well as any patterns you needed. Then, you would use the programmatic features of tawny (i.e. clojure/java) to read in spreadsheets, properties files, any of a range of file formats or databases to make the ontology. For smaller ontologies, you could run protege as a visualizer to see it was all working. For really big ontologies, you could host the whole lot of a big machine somewhere, while developers just did with bits at a time. So, you would need a programmer on the team, but not everyone
        would need these skills.

  3. Very interesting to get some stats on this. The constant diff size of ttl and OWL surprised me, I have to say. It’s pretty bad.

    I think, however, there is an important caveats with Manchester syntax: that it doesn’t quite support the full expressivity of OWL is, I think, not a major issue because OWL is too complicated anyway; but I have found the tools for reading Manchester syntax are not as good as the XML representation and fail some times. For example, my tawnyified version of the Pizza doesn’t (or didn’t, haven’t checked recently) will not round-trip when saved and loaded to OMN (it comes out inconsistent due to a misplaced disjoint axiom). Nowadays, I use OMN as a visualisation syntax to see that Tawny has worked well (along with Protege which I can now use as a visualiser for a live Tawny REPL).

    I have a brief look at Ontomaven and it’s nice enough idea. One of things, however, that comes for free with Tawny is maven support; I can publish my ontologies into a maven repository, with version numbers, and full dependency resolution. Continuous integration comes for free as well. And it’s nice because it’s not ontology-specific; I can continuous integrate an ontology with tawny, hermit and the OWL API.

    Having said that, one thing that I think is currently wrong with ontology development is the idea that integration is necessarily a good thing. We need to have less dependencies between ontologies not more.

    1. One thing I should say is that Uberon probably represents a worst-case scenario. I imagine if we repeated the experiment on ontologies that did not use axiom annotation then the diff size would be more proportional.

      Yes, Manchester doesn’t have full expressivity (you and I are on a W3C team to resolve this but we’ve been pretty quiescent of late). I agree that all of OWL is probably too much for any one ontology, but the parts missing from Manchester are actually pretty useful for us (encoding spatial disjointness axioms via GCIs – see https://github.com/obophenotype/uberon/wiki/Part-disjointness-Design-Pattern ).

      How would you see Tawny integrating with existing well-established ontologies? I can see many advantages to re-engineering many ontologies from the ground-up but this is quite a daunting task.

      Re: dependencies. I don’t think we can avoid the dependencies the way the ontologies are currently divided (with GO occupying a large vertical slice of biology), but maybe ontologies can be better modularized to avoid these dependencies.

      1. “(you and I are on a W3C team to resolve this but we’ve been pretty quiescent of late).”

        You have a better memory than I!. To be honest, since I wrote Tawny, my interest in OMN has dropped a bit.

        Chris — “I agree that all of OWL is probably too much for any one ontology, but the parts missing from Manchester are actually pretty useful for us ”

        CGIs, yes, probably needed. The annotations on everything and anything — which OMN supports but not all tools do (like my omn-mode.el for instance), probably less so.

        Chris — “How would you see Tawny integrating with existing well-established ontologies? I can see many advantages to re-engineering many ontologies from the ground-up but this is quite a daunting task.”

        Restarting is impractical in many cases, yes, I agree.

        I don’t see moving to Tawny happening in a single workflow. There are a number
        of options. A couple are:

        – big bang: take the OWL, render it to Tawny (tawny will do this for you, sort of), then slowly refactor to use programmatic abstract as necessary. We (rather, Jennifer Warrander) have done this with SIO for instance. It’s not effort free, but it’s quite do-able.

        – Leave the dependencies, extend in Tawny. We (Michael Bell in this case) have taken this approach with the music ontology with our Tawny Overtone project. Tawny behaves quite well in these circumstances and can make an OWL dependency look exactly the same as a dependency written in Tawny in the first place.

        – Use Tawny as a side issue — that is have tawny perform some function that is secondary to the main ontology development. Unit tests are an obvious example, internationalisation is another.

        – Use Tawny as part of the morass (rather like Alan R’s stuff in OBI), and similar to the comment I posted above. So develop bits of your ontology with tawny, bits with protege, and maybe some in excel. This might help with modularization also.

        Chris — “dependencies. I don’t think we can avoid the dependencies the way the ontologies are currently divided”

        I guess we shouldn’t muddy the waters with this discussion here; it’s a good pub discussion!

  4. I think saying VCS are a bad idea for OWL ontologies is somewhat of a shortcut. What you are instead showing in the above is that the diff feature of VCS is not adapted to OWL ontologies.
    I don’t think we should have the OWLAPI order axioms – considering the spec doesn’t enforce this, it is IMO opening the oder to having non guaranteed behaviour (even more so if you switch between editing tools)
    I agree that the ability of having diff is a nice feature – however as mentioned above those are not really used in terms of merging but rather as logging tool (i.e., which changes were done since the last released version of the ontology) Several tools have been developed for that purpose – OTOH I can think about bubastis at the EBI, http://www.ebi.ac.uk/efo/bubastis/
    The typical number of editors for a given project should also be considered: sending the “file is locked email” is something I would prefer not to do, but considering at most 3 people are active editor for a file, why go through the trouble of possible conflict? Even if the diff was perfect, the emails would be sent (dealing with the diff in VCS is also painful)
    I concur with your assessment that most editors are not tool friendly – if the expectation is to have something mainstream it will need to be embedded in the tools currently being used (e.g. Protege) – while Phil’s Tawny and Alan’s lisp are great, “normal” people won’t use them.

    We should look at what are the functionality’s we want when we say “we need a diff” – if it is the ability of producing change reports as documentation for the next release, then there are tools doing that. If it is the ability to resolve conflicts (and potentially merge) I would try and see if current curators would be interested in such a feature (and whether it’s a worthwhile endeavour, considering the complexity of the task)

    1. I think it’s OWL syntaxes that need to adopt, not VCSs đŸ™‚

      There is no problem with having a recommended ordering of axioms. OBO Format was built this way from the start. Parsers must accept any ordering, writers are encouraged to follow the ordering. Of all the features of OBO, this was one of the ones we got right. As Michel points out it would be easy to adapt or clone existing writers to follow a standard ordering. It’s just a matter of specifying this (not necessarily a W3C spec, could just be a less formal recommendation) and implementing it.

      I have nothing against people coordinating edits via email. This is always good practice when embarking on a big refactor in any source code. But to require this for every mundane isolated edit – as Phil says, it’s harking back to RCS. It’s not the 90s any more, software engineers would never put up with this, but as ontology engineers working in OWL we’re forced to settle for less.

    2. The reason that I disagree with the idea that VCS diffs are not adapted to OWL, is note that VCS diffs are adapted to Java. And Perl. And python. You can put todo lists in them, or maintain your household budget. But, for OWL, it breaks. Of course, OWL is not alone in this; Word, or excel also breaks for the same reason. In all cases, you have the same issue. These file formats are not sharable source formats. This is part of the reason word (and webprotege) have to have bespoke collaborative support built into them. The question is, why can we maintain complex computational knowledge (in the form of programs) in a diff-able format, but not OWL.

      The reason to go through the pain of a possible conflict is that it enables fork-and-merge semantics. This is so useful, that it more than makes up for the risk of conflict because (in general) these rarely happen. But not with OWL. And this makes a big difference. While I have worked with OBI, I have never once checked in, because the versioning was complicated. When I did the maven port of HermiT, it tooks weeks before I could make the first change, because the VCS was hidden. When I started Tawny, I started to make “drive
      by” patches to the OWL API very quickly because it was easy. In the latter case, I just did it; no emails, nothing. And this really is the norm these days.

      I agree with you, that “normal” people will not want to use Tawny. It is aimed at people who have at least some experience of programming or who are willing to learn. Having said that, in the ontology community, programmers are those willing to learn are not that few and far between, nor is it that difficult. I was normal once, till I learned. But there has to be a good reason to learn. Perhaps, Tawny provides this.

    1. Let’s see answers… at the end of the day, in practice. I do N-triples | sort | diff a lot. It’s pretty crude, but it works well for “RDF”. When you have lot of blank nodes (and restrictions) things may become more complex.

      1. Hi Andrea – unfortunately this will not work well for any graph with blank nodes, and most bio-ontologies are blank node heavy when rendered as RDF.

      2. @cmungall Indeed. When you start to have restrictions, this is an issue. In practice, I tend to minimise blank nodes and generate their serialisation directly, so I have more control. But I’m working in unusual contexts, where basically I have RDF, on top of which I lay fragments of OWL. Could talk more… but that would need to be offline đŸ˜‰

  5. Hi all

    In Oxford we carried out extensive research in this area (http://www.cs.ox.ac.uk/isg/tools/ContentCVS/paperDKE-DATAK-1291.pdf) and developed a concurrent versioning plugin for Protege (http://www.cs.ox.ac.uk/isg/tools/ContentCVS/).

    We also presented a demo some time ago at SWAT4LS: http://www.swat4ls.org/2009/progr.php (http://www.cs.ox.ac.uk/files/4555/Demo3.pdf).

    We did not find the time to update ContentCVS, but we have plans to do it soon in one of the projects we are working in, so feedback about ContentCVS or application-specific requirements would be very welcome.

    In a similar line, the guys from Manchester have also been working in logic-based versioning of ontologies (I think Michel already spotted this):
    http://owl.cs.manchester.ac.uk/research/diff/

    Best
    Ernesto

    1. Hi Ian and Ernesto,

      ContentCVS seems very interesting. Pardon me if I’m missing something but it seems this system is an alternative to other VCSs such as git, svn and cvs? Or does it somehow integrate with them?

      One thing I think it important to point out is that requirements typically go beyond having a bare-bones VCS. Something like git is very feature rich (though most ontology developers don’t need much more that commit, pull and push). More importantly, there are now a range of hosting solutions – a while back everyone used sourceforge, many switched to google code, with github now being favored (with bitbucket still having its loyal fans). Once you have grown used to some of the benefits of managing your source in something like github it would be hard to go back to a less featureful system. Examples include:

      * rich web interface, including easy browsing of commit history and commits
      * a great bug tracking system *integrated with the VCS* allowing you to close issues with commit messages
      * free, no-hassle hosting
      * integrated continuous integration system (Travis), and ease of integration into external CI systems (e.g. Jenkins)
      * ease of forking
      * lots of other fun other things

      If you have a plan for integrating your system with a feature-rich hosting site like github then I am super-interested.

      If it would entail building a parallel stack then perhaps less so.

      A 3rd route would be integrating with something like OntoHub which may be promising (but currently looks like a much less rich version of github).

      OR: we could just define variants of existing W3C OWL syntaxes that are optimized for diffing and reuse existing tools and hosting solutions. Years of experience with OBO Format show this works well.

      Hope this helps,
      Chris

      1. Hi Chris

        ContentCVS was developed as a proof of concept and it created its own VCS; but it could potentially be interagted with other VCS at least the underlying ideas.

        If we finally extend ContentCVS I would also advocate for building it on top of github or similar VCS systems which already provide the basic infrastructure.

        Regarding the extension of OWL syntaxes, I think it already provides enough infrastructure in terms of annotation to both entities and axioms that could be used to guide the versioning.
        One could create a “versioning” meta-ontology to do so. I think there are some efforts in the literature (e.g. http://www.jbiomedsem.com/content/4/1/37)

        Best
        Ernesto

Leave a reply to cmungall Cancel reply