A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

owljs – a javascript library for OWL hacking

owljs ia a javascript library for doing stuff with OWL. It’s available from github:

https://github.com/cmungall/owljs

Whilst it attempts to following CommonJS, you currently have to use RingoJS  (a Rhino engine) as it makes use of JVM calls to the OWLAPI

owl plus rhino equals fun

 

Why javascript?

Why javascript you may ask? Isn’t that a hacky language run in browsers? In fact javascript is increasingly used on the server side as well as in browsers, as can be seen in the success of node.js. With Java 8 incorporating Nashorn as a scripting engine, it looks like javascript on the server side is here to stay.

Why not just use java? Java can be very verbose and is not the ideal language for writing short ontology processing scripts, especially with the OWL API.

There are a number of other languages better suited to scripting and declarative programming in general, many of which run on the JVM. This includes

  • Groovy – a popular choice for interfacing with the OWL API
  • The Armed Bear flavor of Common Lisp, as used in LSW2.
  • Clojure, a variant of lisp, as used in Phil Lord’s powerful Tawny-OWL framework.
  • Scala, a superbly elegant functional programming language used to great effect in Jim Balhoff’s beautifully elegant scowl.
  • Iron Python – a popular choice for interfacing with the Brain. And of course, Python is the de facto language for bioinformatics these days

There are also offerings in non-JVM languages such as my own posh – in addition most languages provide some kind of RDF library, but this can often be low level for working in OWL.

I decided to write a javascript library for a number of reasons. Our group already produces a lot of javascript code, most of which can be run on the server. For example, the golr libraries used in the AmiGO 2 codebase are CommonJS, as are those used for the Monarch API. Thse APIs all access ontologies through services (and can thus be run on a non-JVM javascript engine like node), and we would not make these APIs depend on a JVM. However, the ability to go the other way is useful – in a powerful ontology processing environment that offers all the features of the OWL API, being able to access all kinds of bioinformatics data through ready-made javascript APIs.

Another reason is that JSON is ubiquitous, and having your data format be a subset of the language has some major advantages.

Plus, after an initial period of ambivalence, I have grown to really like javascript. It’s just functional enough to do some cool things.

What can you do with it?

I hope to provide some specific examples later on this blog. For now, take a look at the docs on github. Major features are:

Stay tuned for more information!

 

 

 

 

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed

Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates:

 

 

 

 

 

Creating an ontology project

UPDATE: see the latest post on this subject.

 

This article describes how to manage all the various files created as part of an ontology project. This assumes some familiarity with the unix command line. The article does not describe how to create the actual content for your ontology. For this I recommend the material from the NESCENT course on building anatomy ontologies, put together by Melissa Haendel and Matt Yoder, as well as the OBO Foundry tutorial from ICBO 2013.

Some of the material here overlaps with the page on making a google code project for an ontology.

Let’s make a jelly project

For our purposes here, let’s say you’re a Cnidaria biologist and you’re all ready to start building an ontology that describes the anatomy and traits of this phylum. How do you go about doing this?

Portuguese man-of-war (Physalia physalis)

Portuguese man-of-war (Physalia physalis)

Well, you could just fire up Protege and start building the thing, keeping the owl file on your desktop, periodically copying the file to a website somewhere. But this isn’t ideal. How will you track your edits? How will you manage releases? What about imports from other ontologies (you do intend to import parts of other ontologies, don’t you? If the answer is “no” go back and read the course material above!).

It’s much better to start off on the right foot, keeping all your files organized according to a directory structure common layout, and making use of simple and sensible practices from software engineering.

OORT

As part of your release process you’ll make use of OWLTools and OORT, which can be obtained from the OWLTools google code repository.

Make sure you have OWLTools-Oort/bin/ in your PATH

The first thing to do is to create a directory on your machine for managing all your files – as an ontology developer you will be managing a collection of files, not just the core ontology file itself. We’ll call this directory the “ontology project”.

To make things easy, owltools comes with a handy script called create-ontology-project  to create a stub ontology project. This script is distributed with OWLTools but is available for download here:

http://owltools.googlecode.com/svn/trunk/OWLTools-Oort/bin/create-ontology-project

The first thing to do is select your ontology ID (namespace). This *must* be the same as the ID space you intend to use. So if your URIs/IDs are to be CNIDO_0000001 and so on, the ontology ID *must* be “cnido“. Note that whilst your IDs will be in SHOUTY CAPITALS, the actual ontology itself is all ~~gentle lowercase~~~, even the first letter. This is actually part of OBO Foundry ID policy.

Running the script

Now, type this on the command line:


create-ontology-project cnido

(you will need to add this to your path – it’s in the OWLTools-Oort/bin directory).

You will see the following output:


SUCCESS!
Directory Listing:
-----------------
cnido
cnido/doc
cnido/doc/README.txt
cnido/images
cnido/images/README.txt
cnido/LICENSE.txt
cnido/README.txt
cnido/src
cnido/src/ontology
cnido/src/ontology/catalog-v001.xml
cnido/src/ontology/CHANGES
cnido/src/ontology/diffs
cnido/src/ontology/diffs/Makefile
cnido/src/ontology/imports
cnido/src/ontology/imports/README.txt
cnido/src/ontology/cnido-edit.owl
cnido/src/ontology/cnido-idranges.owl
cnido/src/ontology/Makefile
cnido/tools
cnido/tools/README.txt

followed by:

What now?
* Create a git or svn project. E.g.:
cd cnido
git init
git add -A
git commit -m 'initial commit'
* Now visit github.com and create project cnido-ontology
* Edit ontology and create first release
cd cnido/src/ontology
make initial-build
* Create a jenkins job

What next?

You may not need all the stub files that are created from the outset, but it’s a good idea to have them there from the outset, as you may need them in future.

I recommend your first step is to follow the instructions above to (1) initiate a local git repository by typing the 4 commands above (2) publish this on github. You will need to go to github.com, create an account, create a project, and select “create a project from an existing repository“. (more on this later).

Once this is done, you can start modifying the stub files.

The top level README provides a high level overview of your project. You should describe the content and use cases here. You can edit this in a normal text editor — alternately, if you intend to use github (recommended) then you can wait until you commit and push this file and then edit it via the github web interface.

You will probably spend most of your time in the src/ontology directory, editing cnido-edit.owl

id-ranges

If you intend to have multiple people editing, then the cnido-idranges.owl file will be essential. You can edit this directly in Protege (but it may actually be easier to edit the file in a text editor). Assign each editor an ID range here (just follow the existing file as an example). Note that currently Protege does not read this file, so this just serves as formal documentation.

In future, id-ranges may be superseded by urigen servers, but for now they provide a useful way of avoiding collisions.

Documentation

If you use github or another hosting solution like google code, you can use their wiki system. You should keep any files associated with the documentation (word docs, presentations, etc) in the doc/ folder. You can link to them directly from the wiki.

Images

You can ask the OBO admins to help you set up a purl redirect such that URLs of the form

http://purl.obolibrary.org/obo/cnido/images/

Will redirect to your images/ directory, which is where you will place any pictures of jelly fish or anenome body parts that you want to be associated with classes in the ontology (assuming you have the rights to do this). I recommend Jim Balhoff’s depictions plugin.

Imports

Managing your imports can be a difficult task and deserves its own article.

For now you can browse the ctenophore-ontology project to see an example of a setup, in particular:

This setup uses the OWLAPI for imports, but others prefer to make use of OntoFox.

Releases

You can use OORT to create your release files. The auto-generated Makefile stub should be sufficient to manage a basic release pipeline.In the src/ontology directory, type this:


make all

This should make both cnido.obo and cnido.owl – these will be the files the rest of the world sees. cnido-edit is primarily seen by you and your fellow cnidarian-obsessed editors.

Caveats

Depending on the specific needs of your project, some of the defaults and stubs provided by the create-ontology-project script may not be ideal for you. Or you may simply prefer to create the directory structure manually, it’s not very hard – this is of course fine. The script is provided primarily to help you get started, hopefully it will prove useful.

Finally, if you know any cnidarian biologists interested in contributing to an ontology, let me know as we are lacking detailed coverage in existing ontologies!

 

GO annotation origami: Folding and unfolding class expressions

With the introduction of Gene Association Format (GAF) v2, curators are no longer restricted to pre-composed GO terms – they can use a limited form of anonymous OWL Class Expressions of the form:

GO_Class AND (Rel_1 some V_1) AND (Rel_2 some V2)

The set of relationships is specified in column 16 of the GAF file.

However, many tools are not capable of using class expressions – they discard the additional information leaving only the pre-composed GO_Class.

Using OWLTools it is possible to translate a GAF-v2 set of associations and an ontology O to an equivalent GAF-v1 set of associations plus an analysis ontology O-ext. The analysis ontology O-ext contains the set of anonymous class expressions folded into named classes, together with equivalence axioms, and pre-reasoned into a hierarchy using Elk.

See http://code.google.com/p/owltools/wiki/AnnotationExtensionFolding

For example, given a GO annotation of a gene ‘geneA’:

gene: geneA
annotation_class:  GO:0006915 ! apoptosis
annotation_extension: occurs_in(CL:0000700) ! dopaminergic neuron

The folding process will generate a class with a non-stable URI, automatic label and equivalence axiom:

Class: GO/TEMP_nnnn
  Annotations: label "apoptosis and occurs_in some dopaminergic neuron"
  EquivalentTo: 'apoptosis' and occurs_in some 'dopaminergic neuron'
  SubClassOf: 'neuron apoptosis'

This class will automatically be placed in the hierarchy using the reasoner (e.g. under ‘neuron apoptosis’). For the reasoning step to achieve optimal results, the go-plus-dev.owl version should be used (see new GO documentation). A variant of this step is to perform folding to find a more specific subclass that the one used for direct annotation.

The reverse operation – unfolding – is also possible.  For optimal results, this relies on Equivalent Classes axioms declared in the ontology, so make sure to use the go-plus-dev.owl. Here an annotation to a pre-composed complex term (eg neuron apoptosis) is replaced by an annotation to a simpler GO term (eg apoptosis) with column 16 filled in (e.g. occurs_in(neuron).

The folding operation allows legacy tools to take some advantage of GO annotation extensions by generating an ‘analysis ontology’ (care must be taken in how this is presented to the user, if at all). Ideally more tools will use OWL as the underlying ontology model and be able to handle c16 annotations directly, ultimately requiring less pre-coordination in the GO.

 

Perl library for OWL hacking

I would recommend using a JVM language plus the OWL API for doing programmatic processing of OWL.

NOT perl.

If you really insist on perl, and you don’t mind insane magical AUTOLOAD heavy modules with no documentation:

https://github.com/cmungall/owlhack

Unlike many modules, this doesn’t attempt to map some RDF monster into OWL axioms. It takes in a very simple JSON format and provides a very slim layer on top of that. Unfortunately there isn’t a standard JSON for OWL, so owlhack uses a custom translation as provided by OWLTools. This is a very generic axiom-oriented lispy rendering of OWL functional syntax.

Currently I’m using this module for tasks such as generating ad-hoc chunks of markdown derived from the ontology. The resulting md can then be pasted into github tracker postings, or used to generate html.

There’s also a “sed” script that comes with the library that’s useful for performing perl “s/” operations on annotation values.

It’s all a bit hacky, kind of an OWL replacement for https://github.com/cmungall/obo-scripts

Caveat emptor!