Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

Advertisements

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates:

 

 

 

 

 

Creating an ontology project

UPDATE: see the latest post on this subject.

 

This article describes how to manage all the various files created as part of an ontology project. This assumes some familiarity with the unix command line. The article does not describe how to create the actual content for your ontology. For this I recommend the material from the NESCENT course on building anatomy ontologies, put together by Melissa Haendel and Matt Yoder, as well as the OBO Foundry tutorial from ICBO 2013.

Some of the material here overlaps with the page on making a google code project for an ontology.

Let’s make a jelly project

For our purposes here, let’s say you’re a Cnidaria biologist and you’re all ready to start building an ontology that describes the anatomy and traits of this phylum. How do you go about doing this?

Portuguese man-of-war (Physalia physalis)

Portuguese man-of-war (Physalia physalis)

Well, you could just fire up Protege and start building the thing, keeping the owl file on your desktop, periodically copying the file to a website somewhere. But this isn’t ideal. How will you track your edits? How will you manage releases? What about imports from other ontologies (you do intend to import parts of other ontologies, don’t you? If the answer is “no” go back and read the course material above!).

It’s much better to start off on the right foot, keeping all your files organized according to a directory structure common layout, and making use of simple and sensible practices from software engineering.

OORT

As part of your release process you’ll make use of OWLTools and OORT, which can be obtained from the OWLTools google code repository.

Make sure you have OWLTools-Oort/bin/ in your PATH

The first thing to do is to create a directory on your machine for managing all your files – as an ontology developer you will be managing a collection of files, not just the core ontology file itself. We’ll call this directory the “ontology project”.

To make things easy, owltools comes with a handy script called create-ontology-project  to create a stub ontology project. This script is distributed with OWLTools but is available for download here:

http://owltools.googlecode.com/svn/trunk/OWLTools-Oort/bin/create-ontology-project

The first thing to do is select your ontology ID (namespace). This *must* be the same as the ID space you intend to use. So if your URIs/IDs are to be CNIDO_0000001 and so on, the ontology ID *must* be “cnido“. Note that whilst your IDs will be in SHOUTY CAPITALS, the actual ontology itself is all ~~gentle lowercase~~~, even the first letter. This is actually part of OBO Foundry ID policy.

Running the script

Now, type this on the command line:


create-ontology-project cnido

(you will need to add this to your path – it’s in the OWLTools-Oort/bin directory).

You will see the following output:


SUCCESS!
Directory Listing:
-----------------
cnido
cnido/doc
cnido/doc/README.txt
cnido/images
cnido/images/README.txt
cnido/LICENSE.txt
cnido/README.txt
cnido/src
cnido/src/ontology
cnido/src/ontology/catalog-v001.xml
cnido/src/ontology/CHANGES
cnido/src/ontology/diffs
cnido/src/ontology/diffs/Makefile
cnido/src/ontology/imports
cnido/src/ontology/imports/README.txt
cnido/src/ontology/cnido-edit.owl
cnido/src/ontology/cnido-idranges.owl
cnido/src/ontology/Makefile
cnido/tools
cnido/tools/README.txt

followed by:

What now?
* Create a git or svn project. E.g.:
cd cnido
git init
git add -A
git commit -m 'initial commit'
* Now visit github.com and create project cnido-ontology
* Edit ontology and create first release
cd cnido/src/ontology
make initial-build
* Create a jenkins job

What next?

You may not need all the stub files that are created from the outset, but it’s a good idea to have them there from the outset, as you may need them in future.

I recommend your first step is to follow the instructions above to (1) initiate a local git repository by typing the 4 commands above (2) publish this on github. You will need to go to github.com, create an account, create a project, and select “create a project from an existing repository“. (more on this later).

Once this is done, you can start modifying the stub files.

The top level README provides a high level overview of your project. You should describe the content and use cases here. You can edit this in a normal text editor — alternately, if you intend to use github (recommended) then you can wait until you commit and push this file and then edit it via the github web interface.

You will probably spend most of your time in the src/ontology directory, editing cnido-edit.owl

id-ranges

If you intend to have multiple people editing, then the cnido-idranges.owl file will be essential. You can edit this directly in Protege (but it may actually be easier to edit the file in a text editor). Assign each editor an ID range here (just follow the existing file as an example). Note that currently Protege does not read this file, so this just serves as formal documentation.

In future, id-ranges may be superseded by urigen servers, but for now they provide a useful way of avoiding collisions.

Documentation

If you use github or another hosting solution like google code, you can use their wiki system. You should keep any files associated with the documentation (word docs, presentations, etc) in the doc/ folder. You can link to them directly from the wiki.

Images

You can ask the OBO admins to help you set up a purl redirect such that URLs of the form

http://purl.obolibrary.org/obo/cnido/images/

Will redirect to your images/ directory, which is where you will place any pictures of jelly fish or anenome body parts that you want to be associated with classes in the ontology (assuming you have the rights to do this). I recommend Jim Balhoff’s depictions plugin.

Imports

Managing your imports can be a difficult task and deserves its own article.

For now you can browse the ctenophore-ontology project to see an example of a setup, in particular:

This setup uses the OWLAPI for imports, but others prefer to make use of OntoFox.

Releases

You can use OORT to create your release files. The auto-generated Makefile stub should be sufficient to manage a basic release pipeline.In the src/ontology directory, type this:


make all

This should make both cnido.obo and cnido.owl – these will be the files the rest of the world sees. cnido-edit is primarily seen by you and your fellow cnidarian-obsessed editors.

Caveats

Depending on the specific needs of your project, some of the defaults and stubs provided by the create-ontology-project script may not be ideal for you. Or you may simply prefer to create the directory structure manually, it’s not very hard – this is of course fine. The script is provided primarily to help you get started, hopefully it will prove useful.

Finally, if you know any cnidarian biologists interested in contributing to an ontology, let me know as we are lacking detailed coverage in existing ontologies!

 

Ontologies and Continuous Integration

Wikipedia describes http://en.wikipedia.org/wiki/Continuous_integration as follows:

In software engineering, continuous integration (CI) implements continuous processes of applying quality control — small pieces of effort, applied frequently. Continuous integration aims to improve the quality of software, and to reduce the time taken to deliver it, by replacing the traditional practice of applying quality control after completing all development.

This description could – or should – apply equally well to ontology engineering, especially in contexts such as the OBO Foundry, where ontologies are becoming increasingly interdependent.

Jenkins is a web based environment for running integration checks. Sebastian Bauer, in Peter Robinson’s group had the idea of adapting Jenkins for performing ontology builds rather than software builds (in fact he used Hudson, but the differences between Hudson and Jenkins are minimal). He used  OORT as the tool to build the ontology — Oort takes in one or more ontologies in obo or owl, runs some non-logical and logical checks (via your choice of reasoner) and then “compiles” downstream ontologies in obo and owl formats. Converting to obo takes care of a number of stylistic checks that are non-logical and wouldn’t be caught by a reasoner (e.g. no class can have more than one text definition).

We took this idea and built our own Jenkins ontology build environment, adding ontologies that were of relevance to the projects we were working on. This turned out to be extraordinarily easy – Jenkins is very easy to install and configure, help is always just a mouse click away.

Here’s a screenshot of the main Jenkins dashboard. Ontologies have a blue ball if the last build was successful, a red ball otherwise. The weather icon is based on the “outlook” – lots of successful builds in a row gets a sunny icon. Every time an ontology is committed to a repository it triggers a build (we try and track the edit version of the ontology rather than the release version, so that we can provide direct feedback to the ontology developer). Each job can be customized – for example, if ontology A depends on ontology B, you might want to trigger a build of A whenever a new version of B is committed, allowing you to be forewarned if something in B breaks A.

main jenkins view

Below is a screenshot for the configuration settings for the go-taxon build – this is used to check if there are violations on the GO taxon constraints (dx.doi.org/10.1016/j.jbi.2010.02.002). We also include an external ontology of disjointness axioms (for various reasons its hard to include this in the main GO ontology). You can include any shell commands you like – in principle it would be possible to write a jenkins plugin for building ontologies using Oort, but for now you have to be semi-familiar with the command line and the Oort command line options:

config

Often when a job fails the Oort output can be a little cryptic – generally the protocol is to do detailed investigation using Protege and a reasoner like HermiT to track down the exact problem.

The basic idea is very simple, but works extremely well in practice. Whilst it’s generally better to have all checks performed directly in the editing environment, this isn’t always possible where multiple interdependent ontologies are concerned. The Jenkins environment we’ve built has proven popular with ontology developers, and we’d be happy to add more ontologies to it. It’s also fairly easy to set up yourself, and I’d recommend doing this for groups developing or using ontologies in a mission crticial way.

UPDATE: 2012-08-07

I uploaded some slides on ontologies and continuous integration to slideshare.

UPDATE: 2012-11-09

The article Continuous Integration of Open Biological Ontology Libraries is available on the Bio-Ontologies SIG KBlog site.