OntoTip: Clearly document your design decisions

This is one post in a series of tips on ontology development, see the parent post for more details.

When building a bio-ontology, we frequently make design decisions regarding how we choose to model a particular aspect of the domain. Developing ontologies is not simply a matter of collecting terms and textual definitions, nor is it a matter of recording observations. It involves modeling decisions that reflect how we want to slice and dice the various different generalizations of biological phenomena. These modeling decisions frequently involve trade-offs between different use cases and other factors such as complexity of the ontology.  Sometimes these modeling decisions are made by individual ontology editors; sometimes they are made by a larger group, such as a committee or a content meeting combining domain experts and ontologists. Making these design decisions transparent is really important for making your ontology more usable, and more sustainable.

Model of a generalized eukaryotic cell. Bio-ontologists build models of biological entities and phenomena, such as the internal structure and components of a cell. This is guided both by the underlying ground-truth reality, and design decisions about how to carve up different parts, where to draw boundaries, and how best to generalize over variation in nature.

A note for people reading this from outside the OBO community: bio-ontologies are frequently large, involving thousands or tens of thousands of classes, with varying levels of axiomatization. They are commonly by biocurators for annotation of biological data, and entities such as genes or genomic features. Here ‘annotation’ means creating some kind of association between an entity of interest and an ontology class, typically with provenance and evidence information. These ontologies are usually built by biocurators with biology backgrounds, with broad knowledge of a particular domain. Some of the concerns may be different compared to some of the more ‘data model’ oriented ontologies found outside the life sciences and biomedicine, but some of the concerns may be the same.

Some examples of design decisions:

  • For a biological process or pathway, deciding on the starts and ends of a process (which constrains what the parts of a process can be). For example, does a signaling pathway start with the binding between a receptor activity and a ligand? Does it end with the activity of a transcription factor?
  • For brain regions, how should we draw the boundaries? E.g. does the hippocampus include the dentate gyrus? Do we include different classes to accommodate different groups’ boundary preferences (thus introducing complexity in both nomenclature and ontology structure) or choose to follow a particular standard or preferences of an individual group or researcher (potentially alienating or limiting the applicability of your ontology to these other groups)?
  • How do we represent the relationship between the PNS and the CNS? Do we allow overlap, or do we model as spatially disjoint? There are distinct consequences of each choice that may not be clear from the outset.
  • How should we represent structures such as a vertebra, which can exist in both cartilage form and bony form? (with variation potentially on an ontogenic/developmental axis, and potentially on a phylogenetic axis, e.g. sharks have cartilaginous skeletons). If we bake in assumptions drawn from fully formed humans (i.e. that vertebra is a subClassOf bone), this limits applicability to either developmental biology use cases, or comparative anatomy. In Uberon, we have an endochondral element design pattern, with a triad of structures: the composition-agnostic superclass, and  bony and cartilaginous subclasses. This ensures maximum applicability of the ontology, with annotators choosing the subclass that is appropriate to their organism/time stage. However it comes at some cost of nomenclature complexity, inflation of classes, and potential for annotators to accidentally select the wrong class
  • How should the different subtypes of skeletal tissue be modeled, where divisions can be along a continuum rather than discrete groups? How should the different skeletal elements be related to the tissue that composes them? Should we have distinct classes for ‘bone tissue’ and ‘bone element’?
  • How should environmental processes such as deforestation be linked to environmental physical entities such as forests? What relations should connect these, and what should the logical axioms for both look like?
  • How do we handle chemical entities such as citric acid and citrate which are formally chemically distinct, yet may be interchangeable from a biological perspective? See Hill et al.
  • Which upper ontology classes should be used (if any)? In order to represent the lumen of a subcellular organelle, do we model this as an immaterial entity (thus forcing this class to be in a different subclass hierarchy from the other parts, such as the membrane), or in the same material entity hierarchy? Some ontologies such as OGMS and IDO make use of a lot of different BFO classes, other disease ontologies use fewer (note there will be another post on this topic…)
Screen Shot 2019-06-15 at 3.14.31 PM.png
example of endochondral pattern in uberon. The vertebra exists in up to 3 states: pre-cartilage, cartilage, and bone (with the latter absence in cartilaginous fish). A generic “vertebral element” class captures the vertebra in a composition-agnostic grouping. Subclasses are defined using OWL equivalence axioms, e.g. ‘vertebra cartilage element’ = ‘vertebral element’ and ‘composed primarily of’ some ‘cartilage tissue’

Whatever the ontology and whatever the design decision you and your fellow editors make, I can guarantee that someone will not like that decision; or more frequently, fail to understand it. This often results in confusion and annoyance in trying to use an ontology. Annotators may be presented with two similar-sounding classes, and may not know the background and nuanced reasons you modeled it that way. This can result in frustration, and in inconsistency in how the ontology is applied (with some annotators opting for class A, and some for class B). Sometimes this inconsistency is not noticed for years after, substantial resources have been devoted to annotation. The resulting corpus is far less useful because of this inconsistency in usage. This is something you want to take immediate prospective steps to avoid happening.

Documenting design decisions in an easy to comprehend way is also vital for maintainability of an ontology. Maybe you are lucky to have mad ontology skillz and have thought deeply and very hard about your domain, and have an elaborate internal model of how everything fits together, and you can easily slot terms into the perfect place in the ontology with ease. If this is the case, pause reading this for now and read up about the Bus Factor. This is a concept originally drawn from software engineering — basically, if you get hit by a bus, then no one will be able to carry on the development of your ontology since all the key knowledge is in your head. I should stress this is a metaphorical bus, there is no actual bus driving around mowing down ontologists (although some may find it tempting).

If you document all design decisions it makes it easier for people to come on board and make edits to the ontology in ways that don’t introduce incoherencies. It makes it easier for annotators to understand your intent, reducing frustration, and making it less likely that the ontology is applied inconsistently.

Note that when I am talking about documentation here, I mean documentation in addition to well-formed textual definitions. The topic of writing good text definitions is deserving of its own post, and indeed a future post in this series will be dedicated entirely to definitions. While including good definitions is a necessary condition of a well-documented ontology, it’s a mistake to assume it’s sufficient. This is particularly true for ontologies that incorporate a lot of nuanced fine-grained distinctions. While these can seem like intricate Swiss watches to the designers, they may resemble Rube-Goldberg contraptions to some users.

How an ontology looks to its designer
How an ontology sometimes looks to users

Hopefully this has convinced you (or you were already convinced). So how should you go about documenting these decisions?

There is no one correct way, but I will provide some of my own recommendations here. I should also note that ontologies I work on often fall short of some of these. I will provide examples, from various different ontologies.

Manage your ontology documentation as embedded or external documents as appropriate

Documentation can either be embedded in the ontology, or external.

drawing for blog post
Example of embedded and external documentation. Box surrounded by dotted lines denotes the OWL file. The OWL file contains (1) annotations on classes with URL values point to external docs (2) annotations on classes with human-readable text as values (3) documentation axiom on a class, where the axiom is annotated with a URL (4) design pattern YAML, lives outside OWL file, but managed as text file in GitHub. DP framework generates axioms which can be auto-annotated with documentation axioms (5) external documentation, e.g. on a wiki. This contains more detailed narrative formatted text, with images, figures, examples, etc.

Embedded documentation is documentation contained in the ontology itself, usually as annotation assertion axioms, such as textual definitions, rdfs:comments, etc. (Note here I am using “annotation” in the OWL sense, rather than the sense of annotating data and biological entities using ontologies).

Embedded documentation “follows the ontology around”, e.g. it is present when people download the OWL, it should be visible in ontology browsers (although not all browsers show all annotation properties).

Embedded documentation is somewhat analogous to inline documentation in software development, but a key difference is that ontologies are not encapsulated in the same way; inline documentation in software is typically only visible to developers, not users. (The analogy fits better when thinking about inline documentation for APIs that gets exposed to API users). It is possible for an ontology to include embedded documentation that is only visible to ontology developers, by including a step in the ontology release workflow for stripping out internal definitions. See the point below about eliminating jargon.

External documentation is documentation that exists outside the ontology. It may be managed alongside the ontology as text documents inside your GitHub repo, and version controlled (you are using version control for your ontology aren’t you? If not, stop now and go back to the first post in this series!). Alternatively, it may be managed outside the repo, as a google doc, or in an external wiki. If you are using google docs to manage your documentation, then standard google doc management practice applies: keep everything well-organized in a folder rather than headless; use meaningful document titles (e.g. don’t call a doc “meeting”). Make all your documentation world-readable, and allow comments from a broad section of your community. If you are using mediawiki then categories are very useful, especially if the ontology documentation forms a part of a larger corpus or project documentation. Another choice is systems like readthedocs or mkdocs. If for some unfathomable reason you want to use Word docs, then obviously you should be storing these in version control or in the cloud somewhere (it’s easier to edit Word docs via google docs now), not on your hard drive or in an email attachments. For external documentation I would recommend something where it is easy to provide a URL that takes you to the right section of the text. A Word doc is less suited to this.

You could also explore various minipublication mechanisms. You could publish design documents using Zenodo, and get a DOI for them. This has some nice features such as easy tracking of different versions of a document, and providing more explicit attribution than something like a google doc. Sometimes actual peer-reviewed manuscripts can help serve as documentation; for example, the Vertebrate Skeleton Anatomy Ontology paper was written after an ontology content meeting involving experts in comparative skeletal anatomy and expert ontologists. However, peer-reviewed manuscripts are hard to write (and often take a long time to get reviewed for ontology papers). Even less non-peer reviewed manuscripts can be more time-intensive to write than less formal documentation. Having a DOI is not essential, it’s more important to focus on the documentation content itself and not get too tied to mechanism.

I personally like using markdown format for narrative text. It is easy to manage under version control, it is easy for people to learn and edit, the natively rendering in GitHub is nice, it can be easily converted to formats like HTML using pandoc, and works in systems like readthedocs, as well as GitHub tickets. Having a standard format allows for easy portability of documentation. Whatever system you are using, avoid ‘vendor lockin’. It should be easy to migrate your documentation to a new system. We learned this the hard way when googlecode shut down – the wiki export capabilities turned out not to capture everything, which we only discovered later on.

One advantage of external docs is that they can be more easily authored in a dedicated documented authoring environment. If you are editing embedded documentation as long chunks of text using Protege, then you have limited support for formatting the text or embedding images, and there is no guarantee about how formatting will be rendered in different systems.

However, the decoupling of external docs from the ontology itself can lead to things getting out of sync and getting stale. Keeping things in sync can be a maintenance burden. There is no ideal solution to this but it is something to be aware of.

An important class of documentation is structured templated design pattern specification files, such as DOSDPs or ROBOT templates. This will be a topic of a future post. The DOSDP YAML file is an excellent place to include narrative text describing a pattern, and the rationale for that pattern (see for example the carcinoma DP documentation in Mondo). These could be considered embedded, with the design pattern being a “metaclass” in the ontology, but it’s probably easier to consider these as external documentation. (in the future we hope to have better tools for compiling a DP down into human-friendly markdown or HTML, stay tuned).

Another concept from software development is literate programming. Here the idea is that the code is embedded in narrative text/documentation, rather than vice versa. This can be applied to ontology development as this paper from Phil Lord and Jennifer Warrender demonstrates. I think this is an interesting idea, but it still remains hard to implement for ontologies that rely on a graphical ontology editing environment like Protege, rather than coding an ontology using a system like Tawny-OWL.

Provide clear links from sections of the ontology to relevant external documentation

When should documentation be inlined/embedded, and when should it be managed externally? There is no right answer, but as a rule of thumb I would keep embedded docs to a few sentences per unit of documentation, with anything larger being managed externally. With external docs it’s easier to use formatting, embed figures, etc. Wherever you choose to draw the line, it’s important to embed links to external documentation in the ontology. It’s all very well having reams of beautiful documentation, but if it’s hard to find, or it’s hard to navigate from the relevant part of the ontology to the appropriate section of the documentation, then it’s less likely to be read by the people who need to read it. Ideally everyone would RTFM in detail, but in practice you should assume that members of your audience are incredibly busy and thus appreciate being directed to portions that most concern them.

The URLs you choose to serve up external documentation from should ideally be permanent. Anecdotally, many URLs embedded in ontologies are now dead. You can use the OBO PURL system to mint PURLs for your documentation.

Links to external documentation can be embedded in the ontology using annotation properties such as rdfs:seeAlso

For example, the Uberon class for the appendicular skeleton has a seeAlso link to the Uberon wiki page on the appendages and the appendicular skeleton.

Add documentation to individual axioms where appropriate

As well as class-level annotation, individual axioms can be annotated with URLs, giving an additional level of granularity. This can be very useful, for example to show why a particular synonym was chosen, or why a particular part-of link is justified.

A useful pattern is annotating embedded documentation with a link to external documentation that provides more details.

Unfortunately not all browsers render annotations on axioms, but this is something that can hopefully be resolved soon.

A “legacy documentation pattern” that you will see in some ontologies like GO is to annotate an annotation assertion with a CURIE-style identifier that denotes a content meeting. For example, the class directional locomotion has its text definition axiom annotated with a dbxref “GOC:mtg_MIT_16mar07”. On all browsers like AmiGO, OLS, and OntoBee this shows up as a string “GOC:mtg_MIT_16mar07”. Obviously this is pretty impenetrable to the average user, and this should actually link to the the relevant wiki page. We are actively working to fix this!

Don’t wait: document prospectively

The best time to document is as you are editing the ontology (or even beforehand). Documenting retrospectively is harder.

And remember, good documentation is often a love letter to your future self.

Run ontology content meetings and clearly document key decisions

The Gene Ontology (GO) has a history of running face-to-face ontology content meetings, usually based around a particular biological topic. During these meetings ontology developers experienced with the GO, annotators, and subject matter experts get together to thrash out new areas of the ontology, or improve existing areas. Many other ontologies do this too — for example, see table 1 from the latest HPO NAR paper.

Organization Location Focus
Undiagnosed Diseases Network (UDN); Stanford Center for Inherited Cardiovascular Diseases (SCICD) Stanford University, CA, USA (March 2017) Cardiology
European Reference Network for Rare Eye Disease (ERN-EYE) Mont Sainte-Odile, France (October 2017) Ophthalmology
National Institute of Allergy and Infectious Disease (NIAID) National Institutes of Health, Bethesda, MD, USA (May and July 2018) Allergy and immunology
Neuro-MIG European network for brain malformations (www.neuro-mig.org) St Julians, Malta; Lisbon, Portugal (February 2018; September 2018) Malformations of cortical development (MCD)
European Society for Immunodeficiencies (ESID) and the European Reference network on rare primary immunodeficiency, autoinflammatory and autoimmune diseases (ERN-RITA) Vienna Austria (September 2018) Inborn errors of immunity.

Community workshops and collaborations aimed at HPO content expansion and refinement (from Köhler et al 2019).

One thing that is lacking is a shared set of guidelines across OBO for running a successful content meeting. One thing that is important is to take good notes, make summaries of these, and link this to the relevant areas of the ontology.

A situation you want to avoid is ten years down the line needing to refactor some crucial area of the ontology, and having some vague recollection that you modeled as X rather than Y because that was the preference of the experts, but having no documentation on exactly why they preferred things this way.

Don’t present the user with impenetrable jargon

It is easy for groups of ontology developers to lapse into jargon, whether it is domain-specific jargon, ontology-jargon, or jargon related to their ontology development processes (is there a jargon ontology?).

As an example of ontology jargon, see some classes from BFO such as generically dependent continuant.

b is a generically dependent continuant = Def. b is a continuant that g-depends_on one or more other entities. (axiom label in BFO2 Reference: [074-001]) [http://purl.obolibrary.org/obo/bfo/axiom/074-001 ]

Although BFO is intended to be hidden from average users, it frequently ‘leaks’ for example through ontology imports.

Jargon can be useful as an efficient way for experts to communicate, but as far as possible this should be minimized, with the intended audience clearly labeled, and supplemental documentation for average users.

Pay particular attention to documenting key abstractions (and simplify where possible)

Sometimes ontology developers like to introduce abstractions that give them the ability to introduce finer-grained distinctions necessary for some use case. An example is the endochondral element example introduced earlier. This can introduce complexity into an ontology, so it’s particular important that these abstractions are well-documented.

One thing that consistently causes confusion in users who are not steeped in a particular mindset is proliferation of similar-seeming classes under different BFO categories. For example, having classes for a disease-as-disposition, a disorder-as-material-entity, a disease-course-as-process, a disease-diagnosis-as-data-item, etc. You can’t assume your users have read the BFO manual. It’s really important to document both your rationale for introducing these duplicative classes, and provide easy to consume documentation about how to select the appropriate class for different purposes.

Or perhaps you don’t actually need all of those different upper level categories in your ontology at all? This is the subject of a future post…

Sometimes less is more

More is not necessarily better. If a user has to wade through philosophical musings in order to get to the heart of the matter, they are less likely to actually read the docs.

Additionally, creating too much documentation can create a maintenance burden for yourself. This is especially true if the same information is communicated in multiple different places in the documentation.

Achieving the right balance can be hard. If you are too concise then the danger is the user has insufficient context.

Perfection is the enemy of the good: something is better than nothing

Given enough resources, everything would be perfectly documented, and the documentation would always be in sync. However, this is not always achievable. Rather than holding off on making perfect documentation, it’s better to just put what you have out there.

Perhaps the current state of documentation is a google doc packed with unresolved comments in the margins, or a confusing GitHub ticket with lots of unthreaded comments. It’s important that this can be easily navigated to from relevant sections of the ontology, by at least other ontology developers. I would also advocate for inlining links to this documentation from inside the ontology; this can be clearly labeled as being links to internal documentation so as not to violate the no-jargon principle.

Overall it is both hard and time-consuming to write optimal documentation. When I look back at documentation I have written I often feel I haven’t done a great job, I use jargon too much, or crucial nuances are not well communicated. But we are still learning as a community what the best practices are here, and most of us are drastically under-resourced for ontology development, so all we can do is our best and hope to learn and improve.