A lightweight ontology registry system
August 27, 2015 3 Comments
For a number of years, I have been one of the maintainers of the registry that underpins the list of ontologies at the Open Biological Ontologies Foundry/Library (http://obofoundry.org). I also built some of the infrastructure that creates nightly builds of each ontology, verifying it and providing versions in both obo format and owl.
The original system grew organically and was driven by an ultra-simple file called “ontologies.txt“, stored on google code. This grew to be supplemented by a collection of other files for maintaining the list of issue trackers, together with additional metadata to maintain the central OBO builds. The imminent demise of google code and the general creakiness and inflexibility of the old system has prompted the search for a new solution. I wanted something that would make it much easier for ontology providers to update their information, but at the same time allow the central OBO group the ability to vet and correct entries. We needed something more sophisticated than a flat key-value list, yet not overly complex. We also wanted something compatible with semantic web standards (i.e. to have an RDF file with a description of every ontology it it, using standard vocabularies and ontologies for the properties and classes). We also wanted it to look a bit nicer than the old site, which was looking decidedly 2000-and-late.
The legacy OBOFoundry site, looking dated and missing key information
What are some of the options here?
- A centralized wiki, with a page for each ontology, and groups updating their entry on the wiki
- Each group embeds the metadata about the ontology in a website they maintain. This is then periodically harvested by the central registry. Options for embedding the metadata include microdata and RDFa
- Each group maintains their own metadata in or alongside their ontology in rdf/owl, and this is periodically harvested
- Piggy back off of an existing registry, e.g. BioPortal
- A bespoke registry system, designed from the ground up, with its own relational database underpinning it, etc
These are good all solutions in the appropriate context, but none fitted our requirements precisely. Wikis are best for unstructured or loosely structred narrative text, but attempts to embed structured information inside wikis have been less than satisfactory. The microdata/RDFa approach is interesting, but not practical for us. Microdata is inherently limited in terms of extensibility, and RDFa is complex for many users. Additionally it requires both that groups produce their own web sites (many rely on the OBO Foundry to do this for them), and that we both harvest the metadata and relinquish control. As mentioned previously, it is useful for the OBO repository administrators to have certain fields be filled in centrally (sometimes for policy reasons, sometimes technical). The same concerns underpin the fully decentralized approach, in which every group maintains the metadata directly as part of the ontology, and we harvest this.
Existing registries are built for their own requirements. A bespoke registry system is attractive in many ways, as this can be highly customized, but this can be expensive and we lacked the resources for this.
Solution: GitHub pages and “YAML-LD”
I initially prototyped a solution making use of the GitHub pages framework, driven by YAML files. This can be considered a kind of bespoke system, contradicting what I said above. But rather than roll the entire framework, the system is really just some templates glueing together some existing systems. GitHub support for social coding and YAML helped a lot. The system was very quick to develop and it soon morphed into the actual system to replace the old OBO site.
YAML is a markup language that superficially resembles the tag-value stanza format we were previously using, but crucially allows for nesting. Here is an example of a snippet of YAML for a cephalopod ontology:
id: ceph title: Cephalopod Ontology contact: email: firstname.lastname@example.org label: Chris Mungall description: An anatomical and developmental ontology for cephalopods taxon: id: NCBITaxon:6605 label: Cephalopod
Note that certain tags have ‘objects’ as their fields, e.g. contact and taxon.
We stick to the subset of YAML that can be represented in JSON, and we can thus define a JSON-LD context, allowing for a direct translation to RDF, which is nice. This part is still being finalized, but the basic idea is that keys like ‘title’ will be mapped to dc:title, and the taxon CURIE will be expanded to the full PURL for cephalopoda.
The basic idea is to manage each ontologies metadata as a separate YAML file in a GitHub repository. GitHub features nice builtin YAML rendering, and files can be edited via the GitHub web-interface, which is YAML-aware.
The list of metadata files are here. Note that these are markdown files ( the .md stands for markdown, not metadata). YAML can actually be embedded in Markdown, so each file is a mini-webpage for the ontology with the metadata embedded right in there. This is in some ways similar to the microdata/RDFa approach but IMHO much more elegant.
Each markdown file is rendered attractively through the GitHub interface – for example, here is the md file for the environment ontology, rendered using the builtin GitHub md renderer. Note the yaml block contains structured data and the rest of the file can contain any mixture of markdown and HTML which is rendered on the page. We can do better than this using GitHub pages. Using a simple static site generator and templating system (Jekyll/liquid) we can render each page using our own CSS with our own format. For example here is ENVO again, but rendered using Jekyll. Note that we aren’t even running our own webserver here, this is all a service provided for us, in keeping with our desire to keep things lightweight and resource-light.
The entire system consists of a few HTML templates plus a single python script that derives an uber-metadata file that powers the central table (visible on the front page).
Where the system really shines is the distributed and social editing model. All of this comes for free when hosted on GitHub (in theory GitLab or some other sites should work). Anyone can come along and fork the OBOFoundry.github.io github repository into their own userspace and make edits – they can even do this without leaving their web browser (see the Edit button on the bottom left of every OBO ontology page).
What’s to stop some vandal trashing the registry? Crucially, any edits made by a non-owner remains in their own fork until they issue a Pull Request. After that, someone from OBO will come along and either merge in the pull request, or close it (giving a reason why they did not merge of course). The version control system maintains a full audit trail of this, premature merges can be rolled back, etc.
The task of the OBO team is made easier thanks to Travis-CI, a Continuous Integration system integrated into GitHub. I configured the OBOFoundry github site with a Travis configuration file that instructs Travis to check every pushed commit using an automated test suite – this ensures that people editing their yaml files don’t make syntax errors, or omit crucial metadata.
Screenshot of GitHub pull request, showing a passed Travis check
I have previously written about the use of Continuous Integration in ontology development – although CI was developed primarily for software engineering products, it works surprisingly well for ontologies and for metadata. This is perhaps not surprising if we consider these engineered artefacts in the way software is.
The whole end-to-end process is documented in this FAQ entry on the site.
The system has been working extremely well and is popular among the ontology groups that contribute their expertise to OBO – before official launch of the new site, we had 31 closed pull requests. Whereas previously a member of the OBO team would have to coordinate with the ontology provider to enter the metadata (a time consuming process prone to errors and backlogs), now the provider has the ability to enter information themselves, with the benefit of validation from Travis and the OBO team.
The new site has many other improvements over the last one. It’s not possible to distinguish between the ontology sensu the the umbrella entity vs individual ontology products or editions. For example, the various editions of Uberon (basic, core, composite metazoan) can each be individually registered and validated. There are also a growing number of properties that can be associated with the ontology, from a twitter handle to logos to custom browsers. Hopefully some of these features will be useful to the OBO community. Of course, the overall look could still be massively improved easily by someone with some web design chops (it’s very bland generic bootstrap at the moment). But this isn’t really the point of this post, which is more about the application of a certain set of technologies to allow a balance between centralization and distributed editing that suits the needs of the OBO Foundry. Leveraging existing services like GitHub pages, Travis and the GitHub fork-and-pull-request model allows us to get more mileage for less effort.
The future of metadata
The new OBO site was inspired in many ways by the system developed by my colleague Jorrit Poelen for the Global Biotic Interactions database (GloBI), in which simple JSON metadata files describing each interaction dataset are provided in individual GitHub repositories. A central system periodically harvests these into a large searchable index, where different datasets are integrated. This is not so different from common practice among software developers, who provide metadata for their project in the form of pom.xml files and package.json files (not out of their love of metadata, but more because this provides a useful service or is necessary for working in a wider ecosystem, and integrating with other software components). As James Malone points out, it makes far more sense to simply pull this existing metadata rather than force developers to register in a monolithic rigid centralized registry. If there are incentives for providers of any kind of information artefacts (software, ontologies, datasets) to provide richer metadata at source in large already-existing open repositories such as GitHub then it does away with the need to build separately funded large monolithic registries. The new OBO system and the GloBI approach are demonstrating some of these incentives for ontologies and datasets. The current OBO system still has a large centralized aspect, due in part to the nature of the OBO Foundry, but in future may become more distributed.