owljs – a javascript library for OWL hacking

owljs ia a javascript library for doing stuff with OWL. It’s available from github:

https://github.com/cmungall/owljs

Whilst it attempts to following CommonJS, you currently have to use RingoJSĀ  (a Rhino engine) as it makes use of JVM calls to the OWLAPI

owl plus rhino equals fun

 

Why javascript?

Why javascript you may ask? Isn’t that a hacky language run in browsers? In fact javascript is increasingly used on the server side as well as in browsers, as can be seen in the success of node.js. With Java 8 incorporating Nashorn as a scripting engine, it looks like javascript on the server side is here to stay.

Why not just use java? Java can be very verbose and is not the ideal language for writing short ontology processing scripts, especially with the OWL API.

There are a number of other languages better suited to scripting and declarative programming in general, many of which run on the JVM. This includes

  • Groovy – a popular choice for interfacing with the OWL API
  • The Armed Bear flavor of Common Lisp, as used in LSW2.
  • Clojure, a variant of lisp, as used in Phil Lord’s powerful Tawny-OWL framework.
  • Scala, a superbly elegant functional programming language used to great effect in Jim Balhoff’s beautifully elegant scowl.
  • Iron Python – a popular choice for interfacing with the Brain. And of course, Python is the de facto language for bioinformatics these days

There are also offerings in non-JVM languages such as my own posh – in addition most languages provide some kind of RDF library, but this can often be low level for working in OWL.

I decided to write a javascript library for a number of reasons. Our group already produces a lot of javascript code, most of which can be run on the server. For example, the golr libraries used in the AmiGO 2 codebase are CommonJS, as are those used for the Monarch API. Thse APIs all access ontologies through services (and can thus be run on a non-JVM javascript engine like node), and we would not make these APIs depend on a JVM. However, the ability to go the other way is useful – in a powerful ontology processing environment that offers all the features of the OWL API, being able to access all kinds of bioinformatics data through ready-made javascript APIs.

Another reason is that JSON is ubiquitous, and having your data format be a subset of the language has some major advantages.

Plus, after an initial period of ambivalence, I have grown to really like javascript. It’s just functional enough to do some cool things.

What can you do with it?

I hope to provide some specific examples later on this blog. For now, take a look at the docs on github. Major features are:

Stay tuned for more information!

 

 

 

 

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates: