The Open World Assumption Considered Harmful

A frequent source of confusion with ontologies and more generally with any kind of information system is the Open World Assumption. This trips up novice inexperienced users, but as I will argue in this post, information providers could do much more to help these users. But first an explanation of the terms:

With the Open World Assumption (OWA) we do not make any assumptions based on the absence of statements. In contrast, with the Closed World Assumption (CWA), if something is not explicitly stated to be true, it is assumed to be false. As an example, consider a pet-owner database with the following facts:

Fred type Human .
Farrah type Human .

Foofoo type Cat .
Fido type Dog .

Fred owns Foofoo .
Farrah owns Fido.

Depicted as:

depiction of pet owners RDF graph, with triples indicated by arrows. RDF follows the OWA: the lack of a triple between Fred and Fido does not entail that Fred doesn’t own Fido.

Under the CWA, the answer to the question “how many cats does Fred own” is 1. Similarly, for “how many owners does Fido have” the answer also 1.

RDF and OWL are built on the OWA, where the answer to both question is: at least 1. We can’t rule out that Fred also owns Fido, or that he owns animals not known to the database. With the OWA, we can answer the question “does Fred own Foofoo” decisively with a “yes”, but if we ask “does Fred own Fido” the answer is “we don’t know”. It’s not asserted or entailed in the database, and neither is the negation.

Ontology formalisms such as OWL are explicitly built on the OWA, whereas traditional database systems have features constructed on the CWA.

OWL gives you mechanisms to add closure axioms, which allows you to be precise about what is known not be to true, in addition what is known to be true. For example, we can state that Fred does not own Fido, which closes the world a little. We can also state that Fred only owns Cats, which closes the world further, but still does not rule out that Fred owns cats other than Foofoo. We can also use an OWL Enumeration construct to exhaustively list animals Fred does own, which finally allows the answer to the question “how many animals does Fred own” with a specific number.

OWL ontologies and databases (aka ABoxes) often lack sufficient closure axioms in order to answer questions involving negation or counts. Sometimes this is simply because it’s a lot of work to add these additional axioms, work that doesn’t always have a payoff given typical OWL use cases. But often it is because of a mismatch between what the database/ontology author thinks they are saying, and what they are actually saying under the OWA. This kind of mismatch intent is quite common with OWL ontology developers.

Another common trap is reading OWL axioms such as Domain and Range as Closed World constraints, as they might be applied in a traditional database or a CWA object-oriented formalism such as UML.

Consider the following database plus ontology in OWL, where we attempt to constrain the ‘owns’ property only to humans

owns Domain Human
Fido type Dog
Fred type Human
Fido owns Fred

We might expect this to yield some kind of error. Clearly using our own knowledge of the world something is amiss here (likely the directions of the final triple has been accidentally inverted). But if we are to feed this in to an OWL reasoner to check for incoherencies (see previous posts on this topic), then it will report everything as consistent. However, if we examine the inferences closely, we will see that it is has inferred Fido to be both a Dog and a Human. It is only after we have stated explicit axioms that assert or entail Dog and Human are disjoint that we will see an inconsistency:

OWL reasoner entailing Fido is both a Dog and a Human, with the latter entailed by the Domain axiom. Note the ontology is still coherent, and only becomes incoherent when we add a disjointness axiom

In many cases the OWA is the most appropriate formalism to use, especially in a domain such as the biosciences, where knowledge (and consequently our databases) is frequently incomplete. However, we can’t afford to ignore the fact that the OWA contradicts many user expectations about information systems, and must be pragmatic and take care not to lead users astray.

BioPAX and the Open World Assumption

BioPAX is an RDF-based format for exchanging pathways. It is supposedly an RDF/OWL-based standard, with an OWL ontology defining the various classes and properties that can be used in the RDF representation. However, by their own admission the authors of the format were not aware of OWL semantics, and the OWA specifically, as explained in the official docs in the level 2 doc appendix, and also further expanded on in a paper from 2005 by Alan Ruttenberg, Jonathan Rees, and Joanne Luciano, Experience Using OWL DL for the Exchange of Biological Pathway Information, in particular the section “Ambushed by the Open World Assumption“. This gives particular examples of where the OWA makes things hard that should be easy, such as enumerating the members of a protein complex (we typically know all the members, but the BioPAX RDF representation doesn’t close the world).

BioPAX ontology together with RDF instances from EcoCyc. Triples for a the reaction 2-iminopropanoate + H2O → pyruvate + ammonium is shown. The reaction has ‘left’ and ‘right’ properties for reactants such as H2O. These are intended to be exhaustive but the lack of closure axioms means that we cannot rule out additional reactants for this reaction.

The Ortholog Conjecture and the Open World Assumption

gene duplication and speciation, taken from

In 2011 Nehrt et al made the controversial claim that they had overturned the ortholog conjecture, i.e they claimed that orthologs were less functionally similar than paralogs. This was in contrast to the received wisdom, i.e if a gene duplicates with a species lineage (paralogs) there is redundancy and one copy is less constrained to evolve a new function. Their analysis was based on semantic similarity of annotations in the Gene Ontology.

The paper stimulated a lot of discussion and follow-up studies and analyses. We in the Gene Ontology Consortium published a short response, “On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report“. In this we pointed out that the analysis assumed the CWA (absence of function assignment means the gene does not have that function), whereas GO annotations should be interpreted under the OWA (we have an explicit way of assigning that a gene does not have a function, rather than relying on absence). Due to bias in GO annotations, paralogs may artificially have higher functional similarity scores, rendering the original analysis insufficient to reject the ortholog conjecture.

The OWA in GO annotations is also addressed in the GO Handbook in the chapter Pitfalls, Biases, and Remedies by Pascale Gaudet. This chapter also makes the point that OWA can be considered in the context of annotation bias. For example, not all genes are annotated at the same level of specificity. The genes that are annotated reflect biases in experiments and publication, as well as what is selected to be curated.

Open World Assumption Considered (Sometimes) Harmful

The OWA is harmful where it grossly misaligns with use expectations.

While a base assumption of OWA is usually required with any curated information, it is also helpful to think in terms of an overriding implicit contract between any information provider and information consumer: any (good) information provider attempts to provide as complete information as is possible, given resource constraints.

My squid has no tentacles

Let’s take an example: If I am providing an ontology I purport to be an anatomical ontology for squid, then it behooves me to make sure the main body parts found in a squid are present.

Chiroteuthis veranii, the long armed squid, By Ernst Haeckel, with two elongated tentacles.

Let’s say my ontology contains classes for some squid body parts such as eye, brain, yet lacks classes for others such as the tentacle. A user may be surprised and disappointed when they search for tentacle and come back empty-handed (or empty tentacled, if they are a squid user). If this user were to tell me that my ontology sucked, I would be perfectly within my logical rights to retort: “sir, this ontology is in OWL and thus follows the Open World Assumption; as such the absence of a tentacle class in my squid ontology does not entail that squids lack tentacles, for such a claim would be ridiculous. Please refer to this dense interlinked set of documents published by the W3C that requires PhD in logic to understand and cease from making unwarranted assumptions about my ontology“.

Yet really the user is correct here. There should be an assumption of reasonable coverage, and I have violated that assumption. The tentacle is a major body part, it’s not like I have omitted some obscure neuroanatomical region. Is there a hard and fast dividing line here? No, of course not. But there are basic common sense principles that should be adhered to, and if they cannot be adhered to, omissions and biases should be clearly documented in the ontology to avoid confusing users.

This hypothetical example is made up, but I have seen many cases where biases and omissions in ontologies confusingly lead the user to infer absence where the inference is unwarranted.

Hydroxycholoroquine and COVID-19

The Coronavirus Infectious Disease Ontology (CIDO) integrates a number of different ontologies and includes axioms connecting terms or entities using different object properties. An example is the ‘treatment-for’ edge which connects diseases to treatments. Initially the ontology only contained a single treatment axiom, between COVID-19 and Hydroxychloroquine (HCQ). Under the OWA, this is perfectly valid: COVID-19 has been treated with HCQ (there is no implication about whether treatment is successful or not). However, the inclusion of a single edge of this type is at best confusing. A user could be led to believe there was something special about HCQ compared to other treatments, and the ontology developers had deliberately omitted these. In fact initial evidence for HCQ as a successful treatment has not panned out (despite what some prominent adherents may say). There are many other treatments, many of which are in different clinical trial phases, many of which may prove more effective, yet assertions about these are lacking in CIDO. In this particular case, even though the OWA allows us to legitimately omit information, from a common sense perspective, less is more here: it is better to include no information about treatments at all rather than confusingly sparse information. Luckily the CIDO developers have rapidly addressed this situation.

Ragged Lattices, including species-specific classes

An under-appreciated problem is the confusion ragged ontology lattices can cause users. This can be seen as a mismatch between localized CWA expectations on the part of the user and OWA on the part of the ontology provider. But first an explanation of what I mean by ragged lattice:

Many ontologies are compositional in nature. In a previous post we discussed how the Rector Normalization pattern could be used to automate classification. The resulting multi-parent classification forms a lattice. I have also written about how we should embrace multiple inheritance. One caveat to both of these pieces is that we should be aware of the confusion that can be caused by inconsistently populated (‘ragged’) lattices.

Take for example cell types, which can be classified along a number of orthogonal axes, many intrinsic to the cell itself – its morphological properties, it’s lineage, its function, or gene products expressed. The example below shows the leukocyte hierarchy in CL, largely based on intrinsic properties:

Protege screenshot of the cell ontology, leukocyte hierarchy

Another way to classify cells is by anatomical location. In CL we have a class ‘kidney cell’ which is logically defined as ‘any cell that is part of a kidney’. This branch of CL recapitulates anatomy at the upper levels.

kidney cell hierarchy in CL, recapitulating anatomical classification

so far, perfectly coherent. However, the resulting structure can be confusing to someone now used to thinking in OWL and the OWA. I have seen many instances where a user will go to a branch of CL such as ‘kidney cell‘ and start looking for a class such as ‘mast cell‘. It’s perfectly reasonable for them to look here, as mast cells are found in most organs. However, CL does not place ‘mast cell’ as a subclass of ‘kidney cell’ as this would entail that all mast cells are found in the kidney. And, CL has not populated the cross-product of all the main immune cell types with the anatomical structures in which they can be found. The fleshing out of the lattice is inconsistent, leading to confusion caused by violation of an assumed contract (provision of a class “kidney cell” and incomplete cell types underneath).

This is even more apparent if we introduce different axes of classification, such as the organism taxon in which the cell type is found, e.g. “mouse lymphocyte”, “human lymphocyte”:

inferred hierarchy when we add classes following a taxon design pattern, e.g. mouse cell, mouse lymphocyte. Only a small set of classes in the ontology are mouse specific.

Above is a screenshot of what happens when we introduce classes such as ‘mouse cell’ or ‘mouse lymphocyte’. We see very few classes underneath. Many people indoctrinated/experienced with OWL will not have a problem with this, they understand that these groupings are just for mouse-specific classes, and that the OWA holds, and absence of a particular compositional class, e.g. “mouse neuron” does not entail that mice do not have neurons.

One ontology in which the taxon pattern does work is the protein ontology, which includes groupings like “mouse protein”. PRO includes all known mouse proteins under this, so the classification is not ragged in the same way as the examples above.

There is no perfect solution here. Enforcing single inheritance does not work. Compositional class groupings are useful. However, ontology developers should try and avoid ragged lattices, and where possible populate lattices consistently. We need better tools here, e.g. ways to quantitative measure the raggedness of our ontologies.

Ontologies and databases should better document biases and assumptions

As providers of information, we need to do a better job of making all assumptions explicit and well-documented. This applies particularly to any curated corpus of knowledge, but in particular to ontologies. Even though hiding behind the OWA is logically defensible, we need to make things more intuitive for users.

It’s not uncommon for an ontology to have excellent very complete coverage of one aspect of the domain, and to be highly incomplete in another (reflecting either the biases/interests of the developers, or of the broader field). In fact I have been guilty of this in ontologies I have built or contributed to. I have seen users become confused when a class they expected to find was not present, or they have been perplexed by the ragged lattice problem, or an edge they expected to find was not present.

Few knowledge bases can ever be complete, but we can do better at documenting known unknowns or incompletenesses. We can imagine a range of formal computable ways of doing this, but a good start would be some simple standard annotation properties that can be used as inline documentation in the ontology. Branches of the ontology could be tagged in this way, e.g. to indicate that ‘kidney cell’ doesn’t include all cells found in the kidney, only kidney specific ones; or that developmental processes in GO are biased towards human and model organisms. This system could also be used for Knowledge Graphs and annotation databases too, to indicate that particular genes may be under-studied or under-annotated, an extension of the ND evidence type used in GO.

In addition we could do a better job at providing consistent levels of coverage of annotations or classes. There are tradeoffs here, as we naturally do not want to omit anything, but we can do a better job at balancing things out. Better tools are required here for detecting imbalances and helping populate information in a more balanced consistent fashion. Some of these may already exist and I’m not aware of them – please respond in the comments if you are aware of any!

What is the SARS-CoV-2 molecular parts list?

There is a lot we still have to learn about SARS-CoV-2 and the disease it causes in humans. One aspect of the virus that we do know a lot about is its underlying molecular blueprint. We have the core viral genome, and broadly speaking we know the ‘parts list’ of proteins that are translated and spliced from the genome. There is a lot that we still don’t know about the proteins themselves – how variations affect the ability of the virus to infect a host, which molecules bind to these proteins and how that binding impacts their function. But at least we know the basic parts list. Or we think we do. There is the Spike protein (S), which adorns the surface of this virus like a crown, hence the name ‘coronavirus’. There are the 16 ‘non-structural proteins’ formed by cleavage of a viral polyprotein; one such protein is nsp5 which functions as a protease that performs this same cleavage. And there are the accessory proteins, such as the mysterious ORF8. The genomic blueprint and the translated and cleaved products can be illustrated visually:

SARS-CoV-2 genome and protein products. The ORF1a/1ab polyprotein is cleaved into cleavage products (non-structural proteins; nsp1-16). Note that there are two overlapping polyproteins 1a and 1b, only 1ab is shown here for simplicity. Image taken from

Each of these proteins has a variety of different names, for example, nsp3 is also known as PLpro. The genome is small enough that most scientists working on it have memorized the core aliases such that human-to-human communication is relatively unproblematic.

Of course, as we all know, relying on gene and protein symbols for unique identification in a database for machine-machine communication is a recipe for disaster. Symbols are inherently ambiguous, so we assign identifiers to entities in order to disambiguate them. These identifiers can then be adorned with metadata such as symbols, names, aliases, descriptions, functional descriptions and so on.

As everyone working in bioinformatics knows, different databases assign different identifiers for the same entity (by varying definitions of ‘same’), creating the ubiquitous identifier mapping problem and a cottage industry in mapping solutions.

This is a perennial problem for all omics entities such as genes and proteins, regardless of the organism or system being studied. But when it comes to SARS-CoV-2, things are considerably worse.

It turns out that many problems arise from the relatively simple biological phenomena of cleavage of viral polyproteins. While the molecular biology is not so difficult (one parent protein as a source for many derivative proteins), many bioinformatics databases are not designed with this phenomena in mind. This is fine for scenarios where we can afford to gloss over differences between the immediate products of translation and downstream cleavage products. While cleavage certainly happens in the human genome (e.g POMC), it’s rare enough to effectively ignore in some contexts (although arguably this causes a lot of problems too). However, the phenomena of a single translation product producing multiple functionally discrete units is much more common in viruses, which creates issues for many databases when creating a useful ‘canonical parts list’.

The roll-up problem

The first problem is that many databases either ignore the cleavage products or don’t assign them identifiers in the same category as other proteins. This has the effect of ‘rolling up’ all data to the polyprotein. This undercounts the number of true proteins, and does not provision identifiers for distinct functional entities.

For example, NCBI Gene does a fantastic job of assembling the genetic parts lists for genes across all cellular organisms and viruses. Most of the time, the gene is an appropriate unit of analysis, and we can use gene identifiers as proxies for the product transcribed and translated from that gene. In the case of SARS-CoV-2, NCBI mints a gene ID for the polyprotein (e.g. 1ab), but lacks distinct gene IDs for individual cleavage products ,even though each arguably fulfill the definition of discrete genes, and each is a discrete non-overlapping unit with a distinct function. Referring to the figure above, nsp1-10 are all ‘rolled up’ into the 1ab or 1a polyprotein entity.

Now this is perhaps understandable given that the NCBI Gene database is centered on genes (they do provide distinct protein IDs for the products, see later), and the case can be made that we should only have gene IDs for the immediate protein products (e.g polyproteins and structural proteins and accessory ORFs).

But the roll-up problem also exists for dedicated protein databases such as UniProt. UniProt mint IDs for polyproteins such as 1ab, but there is no UniProt accession for nsp1-16. These are ‘rolled up’ into the 1ab entry, as shown in the screenshot:

UniProt entry for viral polyprotein 1ab. Function summaries for distinct proteins (nsp1-3 shown, others below the fold) are rolled up to the polyprotein level

However, UniProt do provide identifiers for the various cleavage products, these are called ‘chain IDs’, and are of the form PRO_nnnnn. For example, an identifier for the nsp3 product is PRO_0000449621). Due to the structure of these IDs they are sometimes called ‘PRO IDs’ (However, they should not be confused with IDs from the Protein Ontology, which are also called ‘PRO IDs’. Confusing, huh?).

UniProt ‘chain’ IDs, with nsp3 highlighted. These do not get distinct accessions and do not get treated in the same way as a ‘full’ accessioned protein entry

Unfortunately these chain IDs are not quite first-class citizens in the protein database world. For example, the fantastic InterproScan pipeline is executed on the polyproteins, not the chain IDs. This means that domain and GO function calls are made at the level of the polyprotein, so it looks to a machine like there is one super-multifunctional protein that acts as a protease, ADP-ribose binding protein, autophagosome induction, etc. In one sense this is sort of true, but I don’t think it’s a very useful way of looking at protein function. It is more meaningful to assign the functions at the level of the individual cleavage products. It is possible to propagate the interproscan-assigned annotations down to the NSPs using the supplied coordinates, but it should not fall on consumers to do this extra processing step.

The not-quite-first-class status of these identifiers also causes additional issues. For example different ways to write the same ID (P0DTD1-PRO_0000449621 vs P0DTD1:PRO_0000449621 vs P0DTD1#PRO_0000449621 vs simply PRO_0000449621), and no standard URL (although UniProt is working on these issues).

The identical twin identifier problem

An additional major problem is the existence of two distinct identifiers for each of the initial non-structural proteins. Of course, we live with multiple identifiers in bioinformatics all the time, but we generally expect a single database to assign a single identifier for a single entity. Not so!

The problem here is the fact there is a ribosomal frameshift in the translation of the polyprotein in SARS-CoV-2 (again, the biology here is fairly basic), which necessitates two distinct database entries; here: each (called 1ab; aka P0DTD1 and 1a; aka P0DTC1). So far so good. However, while these are truly distinct polyproteins, the non-structural proteins cleaved from them are identical up until the frameshift. However, due to an assumption in databases that each cleavage product must have one ‘parent’, IDs are duplicated. This is shown in the following diagram:

Two polyproteins 1ab and 1a (this is shown an 1a and 1b here, in fact the 1ab pp covers both ORFs). Each nsp 1-10 gets two distinct IDs depending on the ‘parent’ despite sequence identity, and as far as we know, the same function. Diagram courtesy of ViralZone/SIB

While on the surface this may seem like a trivial problem with some easy workarounds, in fact this representation breaks a number of things. First it artificially inflates the proteome making it seems there are more proteins than they actually are. A parts list is less useful if it has to be post-processed in ad-hoc ways to get the ‘true’ parts list.

It can make it difficult when trying to promote the use of standard database identifiers over protein symbols because an arbitrary decision must be made, and if I make a different arbitrary decision from you, then our data does not automatically integrate. Ironically, using standard protein symbols like ‘nsp3’ may actually be better for database integration than identifiers designed for that purpose!

And when curating something like a protein interaction database or a pathway database an orthology database or assembling a COVID Knowledge Graph that deals with pairwise interactions, we must either choose arbitrarily or fully populate the cross-product of all pair combos. E.g. if nsp3 in SARS is orthologous to nsp3 in SARS-CoV-2, then we have to make four statements instead of one.

While I focused on UniProt IDs here, other major resources such as NCBI also have these duplicates in their protein database for the sam reason.

Kudos to Wikidata and the Protein Ontology

Two resources I have seen that gets this right are the Protein Ontology and Wikidata.

The Protein Ontology (aka PR, sometimes known as PRO; NOT to be confused with ‘PRO’ entries in UniProt) includes a single first-class identifier/PURL for each nsp, for example nsp3 has CURIE PR:000050272 ( It has mappings to each of the two sequence-identical PRO chain IDs in UniProt. It also has distinct entries for the parent polyprotein, and it has meaningful ontologically encoded edges linking the two (SARS-CoV-2 protein ontology available from

Protein Ontology entry for nsp3
Protein Ontology entry for SARS-CoV-2 nsp3, shown in obo format syntax (obo format is a human-readable concrete syntax for OWL).

Wikidata also does a good job of providing a single canonical identifier that is 1:1 with distinct proteins encoded by the SARS-CoV-2 genome (for example, the entry for nsp3 However, it is not as complete. Sadly it does not have mappings to either the protein ontology or the UniProt PRO chain IDs (remember: these are different!).

The big disadvantage of Wikidata and the Protein Ontology over the big major sequence databases is that they are not the big major sequence databases. They suffer a curation lag (one employing crowdsourcing, the other manual curation) whereas the main protein databases automate more albeit at the expense of quirks such as non-first-class IDs and duplicate IDs. Depending on the use case, this may not be a problem. Due to the importance of the SARS-CoV-2 proteome, sufficient resources were able to be marshalled on this effort. But will this scale if we want unique non-dupplicate IDs for all proteins in all coronaviruses – including all the different ones infecting bats and other hosts?

A compromise solution

When building KG-COVID-19 we needed to decide which IDs to use as canonical for SARS-CoV-2 genes and proteins. While our system is capable of working with alternate IDs (either normalizing during the KG build stage, or post build as part of a clique-merge step), it is best to avoid these. Mapping IDs can lead to either unintentional roll-ups (information about the cleavage product propagating up to the polyprotein) or worse, fanning-out (rolled up information then spreading to ‘sibling’ proteins); or if 1:1 is enforced the overall system is fragile.

We liked the curation work done by the Protein Ontology, but we knew (1) we needed a system that we could instantly get IDs for proteins in any other viral genome (2) we wanted to be aligned with sources we were ingesting, such as the IntAct curation of the Gordon et al paper, and Reactome plus GO-CAM curation of viral-host pathways. This necessitating the choice of a major database.

Working with the very helpful UniProt team in concert with IntAct curators we were able to ascertain that of the duplicate entries, by convention we should take the ID that comes from the longer polyprotein as the ‘reference’. For example, nsp3 has the following chain IDs:

  • P0DTC1-PRO_0000449637 (from the shorter pp: 1a) [NO]
  • P0DTD1-PRO_0000449621 (from the longer pp: 1ab) [YES]

(Remember, these are sequence-identical and as far as we know functionally identical).

In this case, we take PRO_0000449621 as the canonical/reference entry. This is also the entry IntAct use to curate interactions. We pretend that PRO_0000449637 does not exist.

This is very far from perfect. Biologically speaking, it’s actually the shorter pp that is more commonly expressed, so the choice of the longer one is potentially confusing. These is also the question of how UniProt should propagate annotations. It is valid to propagate from one chain ID to its ‘identical twin’. But what about when these annotations reference other cleavage products (e.g pairwise functional annotation from a GO-CAM, or an ortholog). Do we populate the cross-product? This could get confusing (my interest in this was both from the point of view of our COVID KG, but also wearing my GO hat)

Nevertheless this was the best compromise we could find, and we decided to follow this convention.

Some of the decisions are recorded in this presentation

Working with the UniProt and IntAct teams we also came up with a standard way to write IDs and PURLs for the chain IDs (CURIEs are of the form UniProtKB:ACCESSION-PRO_NNNNNNN). While this is not the most thrilling or groundbreaking development in the battle against coronaviruses, it excites me as it means we have to do far less time consuming and error prone identifier post-processing just to make data link up.

As part of the KG-COVID-19 project, Marcin Joachimiak coordinated the curation of a canonical UniProt-centric protein file (available in our GitHub repository), leveraging work that had been done by UniProt, the protein ontology curators, and the SciBite ontology team. We use UniProt IDs (either standard accessions, for polyproteins, structural proteins, and accessory ORFs; or chain IDs for NSPs) This file differs from the files obtained directly from UniProt, as we include only reference members of nsp ‘twins’, and we exclude less meaningful cleavage products.

This file lives in GitHub (we accept Pull Requests) and serves as one source for building our KG. The information is also available in KGX property graph format, or as RDF, or can be queried from our SPARQL endpoint.

We are also coordinating with different groups such as COVIDScholar to use this as a canonical vocabulary for text mining. Previously groups performing concept recognition on the CORD-19 corpus using protein databases as dictionaries missed the non-structural proteins, which is a major drawback.

Imagine a world

In an ideal world posts like this would never need to be written. There is no panacea; however, systems such as the Protein Ontology and Wikidata which employ an ontologically grounded flexible graph make it easier to work around legacy assumptions about relationships between types of molecular parts (see also the feature graph concept from Chado). The ontology-oriented basis makes it easier to render implicit assumptions explicit, and to encode things such as the relationship between molecular parts in a computable way. Also embracing OBO principles and reusing identifiers/PURLs rather than minting new ones for each database could go some way towards avoiding potential confusion and duplication of effort.

I know this is difficult to conceive of in the current landscape of bioinformatics databases, but paraphrasing John Lennon, I invite you to:

Imagine no proliferating identifiers
I wonder if you can
No need for mappings or normalization

A federation of ann(otations)
Imagine all the people (and machines) sharing all the data, you

You may say I’m a dreamer
But I’m not the only one
I hope some day you’ll join us
And the knowledge will be as one

Using Wikidata for crowdsourced language translations of ontologies

In the OBO world, most of our ontologies are developed by and for English-speaking audiences. We would love to have translations of labels, synonyms, definitions, and so on in other languages. However, we lack the resources to do this (an exception is the HPO, which includes high quality curated translations for many languages).

Wikipedia/Wikidata is an excellent source of crowdsourced language translations. Wikidata provides language-neutral concept IDs that link multiple language-specific Wikipedia pages. Wikidata also includes mappings to ontology class IDs, and provides a SPARQL endpoint. All this can be leveraged for a first pass at language translations.

For example, the Wikidata entity for badlands is mapped to the equivalent ENVO class PURL. This entity in Wikidata also has multiple rdfs:label annotations (maximum one per language).

We can query Wikidata for all rdfs:label translations for all classes in ENVO. I will use the sparqlprog_wikidata framework to demonstrate this:

pq-wikidata ‘envo_id(C,P),label(C,N),Lang is lang(N)’

This compiles down to the following SPARQL which is then executed against the Wikidata endpoint:

SELECT ?c ?p ?n ?lang WHERE {?c <; ?p . ?c <; ?n . BIND( LANG(?n) AS ?lang )}

the results look like this:

wd:Q272921,00000127,Tierras baldías,es

Somewhat disappointingly, there are relatively few translations for ENVO. But this is because the Wikidata property for mapping to ENVO is relatively new. We actually have a large number of outstanding new Wikidata to ENVO mappings we need to upload. Once this is done the coverage will increase.

Of course, different ontologies will differ in how their coverage maps to Wikidata. In some cases, ontologies will include many more concepts; or the corresponding Wikidata entities will have fewer or no non-English labels. But this will likely decrease over time.

There may be other ways to increase coverage. Many ontology classes are compositional in nature, so a combination of language translations of base classes plus language specific encodings of grammatical patterns could yield many more. The natural place to add these would be in the manually curated .yaml files used to specify ontology design patterns, through frameworks like DOSDP. And of course, there is a lot of research in Deep Learning methods for language translation. A combination of these methods could yield high coverage with hopefully good accuracy.

As far as I am aware, these methods have not been formally evaluated. Doing an evaluation will be challenging as it will require high-quality gold standards. Ontology developers spend a lot of time coming up with the best primary label for classes, balancing ontological correctness, elimination of ambiguity, understanding of usage of terms by domain specialists, and (for many ontologies, but not all) avoiding overly abstruse labels. Manually curated efforts such as the HPO translations would be an excellent start.

OntoTip: Don’t over-specify OWL definitions

This is one post in a series of tips on ontology development, see the parent post for more details.

A common mistake is to over-specify an OWL definition (another post will be on under-specification). While not technically wrong, over-specification loses you reasoning power, limiting your ability to auto-classify your ontology. Formally, what I mean by over-specifying here is: stating more conditions than is required for correct entailments

One manifestation of this anti-pattern is the over-specified genus. (this is where I disagree with Seppala et al on S3.1.1, use the genus proximus, see previous post). I will use a contrived example here, although there are many real examples. GO contains a class ‘Schwann cell differentiation’, with an OWL definition referencing ‘Schwann cell’ from the cell ontology (CL).  I consider the logical definition to be neither over- nor under- specified:

‘Schwann cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’

We also have a corresponding logical definition for the parent:

‘glial cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘glial cell’

The Cell Ontology (CL) contains the knowledge that Schwann cells are subtypes of glial cells, which allows us to infer that ‘Schwann cell differentiation’ is a subtype of ‘glial cell differentiation’. So far, so good (if you read the post on Normalization you should be nodding along). This definition does real work for us in the ontology: we infer the GO hierarchy based on the definition and classification of cells in CL. 

Now, imagine that in fact GO had an alternate OWL definition:

Schwann cell differentiation’ EquivalentTo ‘glial cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’

This is not wrong, but is far less useful. We want to be able to infer the glial cell parentage, rather than assert it. Asserting it violates DRY (the Don’t Repeat Yourself principle) as we implicitly repeat the assertion about Schwann cells being glial cells in GO (when in fact the primary assertion belongs in CL). If one day the community decides that in fact that Schwann cells are not glial but in fact neurons (OK, so this example is not so realistic…), then we have to change this in two places. Having to change things in two places is definitely a bad thing.

I have seen this kind of genus-overspecification in a number of different ontologies; this can be a side-effect of the harmful misapplication of the single-inheritance principle (see ‘Single inheritance considered harmful’, a previous post). This can also arise from tooling limitations: the NCIT neoplasm hierarchy has a number of examples of this due to the tool they originally used for authoring definitions.

Another related over-specification is too many differentiae, which drastically limits the work a reasoner and your logical axioms can do for you. As a hypothetical example, imagine that we have a named cell type ‘hippocampal interneuron’, conventionally defined and used in the (trivial) sense of any interneuron whose soma is located in a hippocampus. Now let’s imagine that single-cell transcriptomics has shown that these cells always express genes A, B and C (OK, there are may nuances with integrating ontologies with single-cell data but let’s make some simplifying assumptions for now)/

It may be tempting to write a logical definition:

‘hippocampal interneuron’ EquivalentTo

  • interneuron AND
  • has-soma-location SOME hippocampus AND
  • expresses some A AND
  • expresses some B AND
  • expresses some C

This is not wrong per se (at least in our hypothetical world where hippocampal neurons always express these), but the definition does less work for us. In particular, if we later include a cell type ‘hippocampus CA1 interneuron’ defined as any interneuron in the CA1 region of the hippocampus, we would like this to be classified under hippocampal neuron. However, this will not happen unless we redundantly state gene expression criteria for every class, violating DRY.

The correct thing to do here is to use what is sometimes called a ‘hidden General Class Inclusion (GCI) axiom’ which is just a fancy way of saying that SubClassOf (necessary conditions) can be mixed in with an equivalence axiom / logical definition:

‘hippocampal interneuron’ EquivalentTo interneuron AND has-soma-location SOME hippocampus

‘hippocampal interneuron’ SubClassOf expresses some A

‘hippocampal interneuron’ SubClassOf expresses some B

‘hippocampal interneuron’ SubClassOf expresses some C

In a later post, I will return to the concept of an axiom doing ‘work’, and provide a more formal definition that can be used to evaluate logical definitions. However, even without a formal metric, the concept of ‘work’ is intuitive to people who have experience using OWL logical definitions to derive hierarchies. These people usually intuitively test things in the reasoner as they go along, rather than simply writing an OWL definition and hoping it will work.

Another sign that you may be overstating logical definitions is if they are for groups of similar classes, yet they do not fit into any design pattern template.

For example, in the above examples, the cell differentiation branch of GO fits into a standard pattern

cell differentiation and results-in-acquisition-of-features-of some C

where C is any cell type. The over-specified definition does not fit this pattern.




Proposed strategy for semantics in RDF* and Property Graphs

Update 2020-09-12: I created a GitHub repo that concretizes part of the proposal here

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand


As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:


Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)

proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>>  owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png

Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png

protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)

Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

obographviz 0.2.2 released – now with visualization of equivalence cliques

Version 0.2.2 of out obographviz javascript library is up on npm:

obographviz converts ontologies (in  OBO Graph JSON) to dot/graphviz with powerful JSON stylesheet configuration. The JSON style sheets allow for configurable color and sizing of nodes and edges. While graphviz can look a bit static and clunky compared to these newfangled javascript UI libraries, I still find it the most useful way to visualize complex dense ontology graphs.
One of the most challenging things to visualize is the parallel structure of multiple ontologies covering a similar area (e.g. multiple anatomy ontologies, or phenotype ontologies, each covering a different species).
This new release allows for the nesting of equivalence cliques from either xrefs or equivalence axioms. This can be used to visualize the output of algorithms such as our kBOOM (Bayesian OWL Ontology Merging) tool. Starting with a defined set of predicates/properties (for example, the obo “xref” property, or an owl equivalentClasses declaration between two classes), it will transitively expand these until a maximal clique is found for each node. All such mutually connected nodes with then be visualized inside one of these cliques.
An example is shown below (Uberon (yellow) aligned with ZFA (black) and two Allen Brain Atlas ontologies (grey and pink); each ontology has its own color; any set of species-equivalent nodes are clustered together with a box drawn around them. isa=black, part_of=blue). Classes that are unique to a single ontology are shown without a bounding box.

A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
  "edges" : [
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D


 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D


 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: ""
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        - "GOC:jl"
      - ""
      - ""
      - ""
      - ""
      - val: "NIF_Subcellular:sao628508602"
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"


Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:


The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit


At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.


Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:


This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido


This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:

EXECUTING: git status
# On branch master
nothing to commit, working directory clean
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to:
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
cd target/cnidaria-ontology
git remote add origin
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!



A lightweight ontology registry system

For a number of years, I have been one of the maintainers of the registry that underpins the list of ontologies at the Open Biological Ontologies Foundry/Library ( I also built some of the infrastructure that creates nightly builds of each ontology, verifying it and providing versions in both obo format and owl.

The original system grew organically and was driven by an ultra-simple file called “ontologies.txt“, stored on google code. This grew to be supplemented by a collection of other files for maintaining the list of issue trackers, together with additional metadata to maintain the central OBO builds. The imminent demise of google code and the general creakiness and inflexibility of the old system has prompted the search for a new solution. I wanted something that would make it much easier for ontology providers to update their information, but at the same time allow the central OBO group the ability to vet and correct entries. We needed something more sophisticated than a flat key-value list, yet not overly complex. We also wanted something compatible with semantic web standards (i.e. to have an RDF file with a description of every ontology it it, using standard vocabularies and ontologies for the properties and classes). We also wanted it to look a bit nicer than the old site, which was looking decidedly 2000-and-late.

Screen Shot 2015-08-26 at 7.57.26 PM

The legacy OBOFoundry site, looking dated and missing key information

What are some of the options here?

  • A centralized wiki, with a page for each ontology, and groups updating their entry on the wiki
  • Each group embeds the metadata about the ontology in a website they maintain. This is then periodically harvested by the central registry. Options for embedding the metadata include microdata and RDFa
  • Each group maintains their own metadata in or alongside their ontology in rdf/owl, and this is periodically harvested
  • Piggy back off of an existing registry, e.g. BioPortal
  • A bespoke registry system, designed from the ground up, with its own relational database underpinning it, etc

These are good all solutions in the appropriate context, but none fitted our requirements precisely. Wikis are best for unstructured or loosely structred narrative text, but attempts to embed structured information inside wikis have been less than satisfactory. The microdata/RDFa approach is interesting, but not practical for us. Microdata is inherently limited in terms of extensibility, and RDFa is complex for many users. Additionally it requires both that groups produce their own web sites (many rely on the OBO Foundry to do this for them), and that we both harvest the metadata and relinquish control. As mentioned previously, it is useful for the OBO repository administrators to have certain fields be filled in centrally (sometimes for policy reasons, sometimes technical).  The same concerns underpin the fully decentralized approach, in which every group maintains the metadata directly as part of the ontology, and we harvest this.

Existing registries are built for their own requirements. A bespoke registry system is attractive in many ways, as this can be highly customized, but this can be expensive and we lacked the resources for this.

Solution: GitHub pages and “YAML-LD”

I initially prototyped a solution making use of the GitHub pages framework, driven by YAML files. This can be considered a kind of bespoke system, contradicting what I said above. But rather than roll the entire framework, the system is really just some templates glueing together some existing systems. GitHub support for social coding and YAML helped a lot. The system was very quick to develop and it soon morphed into the actual system to replace the old OBO site.


YAML is a markup language that superficially resembles the tag-value stanza format we were previously using, but crucially allows for nesting. Here is an example of a snippet of YAML for a cephalopod ontology:

id: ceph
title: Cephalopod Ontology
  label: Chris Mungall
description: An anatomical and developmental ontology for cephalopods
  id: NCBITaxon:6605
  label: Cephalopod

Note that certain tags have ‘objects’ as their fields, e.g. contact and taxon.

We stick to the subset of YAML that can be represented in JSON, and we can thus define a JSON-LD context, allowing for a direct translation to RDF, which is nice. This part is still being finalized, but the basic idea is that keys like ‘title’ will be mapped to dc:title, and the taxon CURIE will be expanded to the full PURL for cephalopoda.

The basic idea is to manage each ontologies metadata as a separate YAML file in a GitHub repository. GitHub features nice builtin YAML rendering, and files can be edited via the GitHub web-interface, which is YAML-aware.

The list of metadata files are here. Note that these are markdown files ( the .md stands for markdown, not metadata). YAML can actually be embedded in Markdown, so each file is a mini-webpage for the ontology with the metadata embedded right in there. This is in some ways similar to the microdata/RDFa approach but IMHO much more elegant.

GitHub Pages

Each markdown file is rendered attractively through the GitHub interface – for example, here is the md file for the environment ontology, rendered using the builtin GitHub md renderer. Note the yaml block contains structured data and the rest of the file can contain any mixture of markdown and HTML which is rendered on the page. We can do better than this using GitHub pages. Using a simple static site generator and templating system (Jekyll/liquid) we can render each page using our own CSS with our own format. For example here is ENVO again, but rendered using Jekyll. Note that we aren’t even running our own webserver here, this is all a service provided for us, in keeping with our desire to keep things lightweight and resource-light.

Screen Shot 2015-08-26 at 10.43.30 PM

The entire system consists of a few HTML templates plus a single python script that derives an uber-metadata file that powers the central table (visible on the front page).

Distributed editing

Where the system really shines is the distributed and social editing model. All of this comes for free when hosted on GitHub (in theory GitLab or some other sites should work). Anyone can come along and fork the github repository into their own userspace and make edits – they can even do this without leaving their web browser (see the Edit button on the bottom left of every OBO ontology page).

What’s to stop some vandal trashing the registry? Crucially, any edits made by a non-owner remains in their own fork until they issue a Pull Request. After that, someone from OBO will come along and either merge in the pull request, or close it (giving a reason why they did not merge of course). The version control system maintains a full audit trail of this, premature merges can be rolled back, etc.

The task of the OBO team is made easier thanks to Travis-CI, a Continuous Integration system integrated into GitHub. I configured the OBOFoundry github site with a Travis configuration file that instructs Travis to check every pushed commit using an automated test suite – this ensures that people editing their yaml files don’t make syntax errors, or omit crucial metadata.

github merge page with travis check

Screenshot of GitHub pull request, showing a passed Travis check

I have previously written about the use of Continuous Integration in ontology development – although CI was developed primarily for software engineering products, it works surprisingly well for ontologies and for metadata. This is perhaps not surprising if we consider these engineered artefacts in the way software is.

The whole end-to-end process is documented in this FAQ entry on the site.

The system has been working extremely well and is popular among the ontology groups that contribute their expertise to OBO – before official launch of the new site, we had 31 closed pull requests. Whereas previously a member of the OBO team would have to coordinate with the ontology provider to enter the metadata (a time consuming process prone to errors and backlogs), now the provider has the ability to enter information themselves, with the benefit of validation from Travis and the OBO team.

Other features

The new site has many other improvements over the last one. It’s not possible to distinguish between the ontology sensu the the umbrella entity vs individual ontology products or editions. For example, the various editions of Uberon (basic, core, composite metazoan) can each be individually registered and validated. There are also a growing number of properties that can be associated with the ontology, from a twitter handle to logos to custom browsers. Hopefully some of these features will be useful to the OBO community. Of course, the overall look could still be massively improved easily by someone with some web design chops (it’s very bland generic bootstrap at the moment). But this isn’t really the point of this post, which is more about the application of a certain set of technologies to allow a balance between centralization and distributed editing that suits the needs of the OBO Foundry. Leveraging existing services like GitHub pages, Travis and the GitHub fork-and-pull-request model allows us to get more mileage for less effort.

The future of metadata

The new OBO site was inspired in many ways by the system developed by my colleague Jorrit Poelen for the Global Biotic Interactions database (GloBI), in which simple JSON metadata files describing each interaction dataset are provided in individual GitHub repositories. A central system periodically harvests these into a large searchable index, where different datasets are integrated. This is not so different from common practice among software developers, who provide metadata for their project in the form of pom.xml files and package.json files (not out of their love of metadata, but more because this provides a useful service or is necessary for working in a wider ecosystem, and integrating with other software components). As James Malone points out, it makes far more sense to simply pull this existing metadata rather than force developers to register in a monolithic rigid centralized registry. If there are incentives  for providers of any kind of information artefacts (software, ontologies, datasets) to provide richer metadata at source in large already-existing open repositories such as GitHub then it does away with the need to build separately funded large monolithic registries. The new OBO system and the GloBI approach are demonstrating some of these incentives for ontologies and datasets. The current OBO system still has a large centralized aspect, due in part to the nature of the OBO Foundry, but in future may become more distributed.

Chris Mungall

A response to “Unintended consequences of existential quantifications in biomedical ontologies”

In Unintended consequences of existential quantifications in biomedical ontologies, Boeker et al attempt to

…scrutinize the OWL-DL releases of OBO ontologies to assess whether their logical axioms correspond to the meaning intended by their authors

The authors examine existential restriction axioms in a number of ontologies (whose source is in obo-format) and rate them according to the correspondence between the semantics and the presumed author intent. They claim:

  • usability issues with OBO ontologies
  • lack of ontological commitment for several common terms
  • proliferation of domain-specific relations
  • numerous assertions which do not properly describe the underlying biological reality, or are ambiguous and difficult to interpret.

The proposed solution:

The solution is a better anchoring in upper ontologies and a restriction to relatively few, well defined relation types with given domain and range constraints

I think this is an interesting paper, and have great respect for all the authors involved. However, I find some of the claims to be suspect and need countered. I do think the paper shows that we need much better ontology and ontology-technology documentation from the obo foundry effort (which I am a part of); however, I think the authors have read far too much into the lack of documentation and consequently muddy the issues on a number of matters.

The initial misunderstanding is presented at the start of the paper:

This extract asserts the relationship part_of between the terms ankle and hindlimb in OBO format.

id: MA:0000043
name: ankle
relationship: part of MA:0000026 ! hindlimb

This assertion does not commit to a semantics in terms of the real world entities which are denoted by the terms. It does not allow us to infer that, e.g., all hindlimbs have ankles, or all ankles are part of a hindlimb. Descriptions at this level require some kind of ontological interpretation for the OBO syntax in terms of OWL axioms, as OWL axioms are explicitly quantified

In fact this is incorrect. There is an ontological interpretation for the OBO syntax in terms of OWL axioms (which the authors provide, falsely stating that it is “one such interpretation”):

Ankle subClassOf part_of some Hindlimb

The authors provide links to official documentation confirming that this is the correct interpretation. They then go on to say:

Our mouse limb example could therefore be alternatively translated into at least the following three OWL expressions:

(i) Ankle subClassOf part_of some Hindlimb

(ii) Ankle subClassOf part_of exactly 1 Hindlimb

(iii) Ankle subClassOf part_of only Hindlimb

In fact there is some legitimate confusion over interpretation of relations due to the impedance mismatch between the treatment of time in the 2005 Relations Ontology paper and what is possible in OWL. But positing additional unwarranted interpretations just muddies the waters. In fact, regardless of the time issue, the RO 2005 paper is quite clear that the relations used should be read in an all-some fashion (ie interpretation (i)). This is consistent with what the Goldbreich/Horrocks translation and its current successor the obof1.4 specification, all of which are cited by the authors.

This claimed lack of a standard interpretation informs the main thesis advanced by the authors: the translation of obo-format relationships to existential restrictions is not always what ontology authors intend. In fact they are testing for something stronger, specifically the claim that every such translated existential restriction implies existential dependence, where this is defined:

x dependsG for its existence upon Fs = df

Necessarily, x exists only if some F exists

It is worth noting that dependence claim they are testing is a strong one, is stronger than anything in the OWL semantics and would be violated by a number of other ontologies, many in OWL such as the NCIt due to the prevalence of “may_do_X” type relations. This is a subtle point that may escape the casual reader of the paper.

The authors examined axioms in a number of ontologies and evaluated them to see whether there were uses of existential restrictions where this strong dependence claim is not justified. Their test set included ontologies from the OBO library as well as a number of external support ontologies (aka “cross product” ontologies). Most of these ontologies currently use obo-format as their source. They did not invite external domain experts, and they did not check their results with the authors of the ontologies.

The authors provide examples where they believe there are unintended consequences of existential restrictions, based on this strong interpretation. Many of the examples they provide are problematic, as I will illustrate.

They provide this example from the GO:

“Interkinetic nuclear migration SubClassOf

part_of some Cell proliferation in forebrain

The ontological dependence expressed by this assertion is that there are no interkinetic nuclear migration processes without a corresponding cell proliferation in forebrain process. This is obviously false, since interkinetic nuclear migration is a very fundamental cell process, which is not limited to forebrains. An easy fix to this error is the inversion of the expression by using the inverse relationship:

Cell proliferation in forebrain subclassOf

has_part some Interkinetic nuclear migration”

In fact, the GO editors are well aware of the all-some interpretation, they did intend to say that all instances of IKNM are in a forebrain, this is clear from the textual definition (I have highlighted the relevant part):

id: GO:0022027
name: interkinetic nuclear migration
def: “The movement of the nucleus of the ventricular zone cell between the apical and the basal zone surfaces. Mitosis occurs when the nucleus is near the apical surface, that is, the lumen of the ventricle.” [GO_REF:0000021, GOC:cls, GOC:dgh, GOC:dph, GOC:jid, GOC:mtg_15jun06]
is_a: GO:0051647 ! nucleus localization
relationship: part_of GO:0021846 ! cell proliferation in forebrain

The mistake GO has made is giving the class a misleadingly generic label. This kind of thing is not unheard of in the GO – a class is given a label that has a specific meaning to one community when in fact the label is used more generally by a wider community. This is not to understate this kind of mistake – it’s actually quite serious (annotators are meant to always read the definition but unfortunately this rule isn’t always followed). However, the problem is entirely terminological and not in any way related interpretations of the relationship tag or existential quantification. The creators of this class really did intend to restrict the location to the forebrain (This was confirmed by one of the GO editors listed as provenance for the definition).

The authors are on safer ground with their analysis of structural relations such as has_parent_hydride in CHEBI. I don’t have such a problem here, but it would have been useful to see the claims tested. Can we use a reasoner to determine an inconsistency in the ontology (supplemented with additional axioms) using a reasoner? It seems that the problem is less in the computational properties of the existential restriction, and more in the existential dependence claim (which, remember, is stronger than what is claimed by the OWL semantics).

They also cover what they perceive to be a problem with the use of existential restrictions in conjunction with what BFO calls “realizables”:

A statement such as

Anisotropine methylbromide subclassOf has_role some Anti-ulcer drug

in ChEBI asserts that each and every anisotropine methylbromide molecule has the role of an anti-ulcer drug. However, this role may never be realized for a particular molecule instance, since that molecule may play a different role in the treatment of a different disease, or play no role at all. It is thus problematic to assert an existential dependence between the molecule and the realization of the role (in the treatment of an ulcer)

This is a reasonable philosophical analysis. But are there actually any negative consequences for a user of the ontology or for reasoning? Does it lead to any incorrect inferences? I’m not convinced that an existential restriction is so wrong here. The problems uncovered with this example are really to do with some obscure conditions on bfo roles (ie all roles are realized – if roles were like dispositions this would not be a problem) and to be fair on the CHEBI people they might not have been aware of that when they made the axiom (BFO needs better more user-friendly documentation).

The same “problem” is uncovered with some of the GO MF cross products, but this time the mistake lies with the authors. They say:

This is particularly apparent in the Gene Ontology molecular function ontology. For example, the statement

tRNA sulfurtransferase subClassOf

has_input some Transfer RNA

asserts a dependency of every instance of tRNA sulfurtransferase on some instance of Transfer RNA. Functions include the possibility that the bearer of a function is never involved in any process that realizes the function, thus may never have input molecules. This kind of error predominates in the Cross Product sample, especially in the cross product ‘GO Molecular Function X ChEBI’. Interrater agreement was low here because of two conflicting positions: (1) the assertion is false, because functions can remain unrealized, or (2) the assertion is true, but the categorization as a function is false, as implied by the suffix “activity”.

In fact this latter interpretation (2) is the correct one. The term in the GO is “tRNA sulfurtransferase activity“. Now, the authors do have a good point here – the ontological commitment of GO towards BFO was unclear here (this is now made more explicit with an ontology of bridging axioms that make “catalytic acitivity” a subclass of bfo:process – but note this is still controversial with the BFO people). With the correct intepretation, the authors statement that “This kind of error predominates in the Cross Product sample” is not supported. The authors have simply jumped to the conclusion that everything in GO MF must be a BFO function based on the name of the ontology (which was named by biologists, not philosophers) and extrapolated from this that the ontology is full of errors, in particular unintended consequences of existential restrictions.

Interestingly, whilst focusing on the inconsequential problem of violation of existential dependence, they missed the real unintended consequences. In fact there is a lurking closed world assumption with all of these GO MF logical definitions, and OWL is open world! Each of the reactions that’s defined in terms of inputs and outputs should explicitly state the cardinality of all participants, and in addition there needs to be a closure axiom to say there are no additional inputs or outputs! Unlike the philosophical problem of positing existential dependence between a function and a (potential) continuant (which is not even a problem here, as the reactions are intended to be interpreted as bfo:processes),  this gives results that are empirically wrong! So there was a real serious example of unintended consequences that were missed.

It’s important to bear in mind that some of these cross-product files are separate from the main ontology, not fully official not of as high quality as the main ontology. The GO BP chebi ones are maintained by the GO editors and are high quality, but the definitions for GO MF reactions were created by me, semi-automatically, and not as high quality. This draws attention to the need for better documentation here – if the paper had simply criticized the lack of documentation and clear commitment to BFO they would have been spot on, but instead they use this to make spurious claims.

The authors are on shaky ground with anatomy ontologies again:

Time dependencies

These are commonly expressed in ontologies encoding development or other time-dependent processes. Kinds of participation in such time dependent processes can be difficult to pin down as can the exact ontological dependence between the process and the material entities. The start and end relations are intending to express just such time dependencies to do with the development of anatomical structures.

Pharyngeal endoderm subClassOf

end some

Pharyngula:Prim-15 Roof plate rhombomere 5


start some Segmentation:10-13 somites

However, the stages of development mentioned may not be complete before the material entity comes fully into existence. They also may not be complete when the material entity stops existing. It is difficult to claim a processual entity (which extends over time) is ontologically necessary for a material entity to exist (the claim of existential dependence) unless the material entity was a clear output of this process. The solution here is, again, to substitute existential restriction by value restriction, such as

Pharyngeal endoderm subClassOf

end only Pharyngula:Prim-15

It’s difficult to see a real problem here. So what if the stages are not completed? The authors’ problem is now with the generated OWL but with the strong existential dependence claim: “It is difficult to claim a processual entity (which extends over time) is ontologically necessary for a material entity to exist (the claim of existential dependence) unless the material entity was a clear output of this process“. To which I would reply: so what? The authors are testing a claim that is too strong. The OWL is correct, and the OWL does not make any claims about existential dependence, that claim is in the authors’ minds. It’s difficult to see any practical problems with the OWL representation of the ZFA relations here. If the problem is purely philosophical, this should be published in a philosophical journal.

Furthermore, the supposed correct solution to this non-problem is terrible: using a universal restriction means fixing at a single artificial level of granularity for the stages (i.e we can’t have a property chain end o part_of -> end, which would lead to incorrect inferences).

Again, there are some real problems with some ontologies hidden here:

  • the start and end relations are undefined. There are a number of people working on this, but admittedly it’s lame there is no standard solution in place yet. We should at least have some better documentation for start/end.
  • there are a lot of hidden assumptions in OBO ontologies regarding how applicable each ontology is for the full range of variation found in nature. The FMA has a whole story about “canonical anatomy” here. For many biological ontologies there’s a shared understanding between authors and users that the ontology represents a simplified reference picture, and in fact there may be the occasional zebrafish pharyngeal endoderm that ends a bit after or a bit before the prim-15 stage. We the OBO Foundry could be doing more to ensure this is explicit

If the authors had highlighted this I would have agreed wholeheartedly and apologized for the current state of affairs. However, this doesn’t really have anything to do with “unintended consequences of existential quantifications”. It’s just a plain case of lack of documentation (not that this is excusable, but the point is that the paper is not titled “lack of documentation in biomedical ontologies”)

Finally, the authors also include a discussion of the use of existential restrictions in conjunction with relations such as lacks_part. This part is fairly reasonable but most of it has been said in other publications. There are actually some subtleties here when it comes to phenotype ontologies, but this is best addressed in a separate post. There is a solution partly in place now involving shortcut relations, but this wasn’t mature when the authors wrote the paper, so fair enough.

Overall I wasn’t convinced by these results. The results were not externally validated (this would have  been easy to do – for example, by contacting the ontology authors or pointing out the error on a tracker) and relied on subjective opinions of the authors (even then they largely did not agree). In addition, the relationships were being tested for existential dependence, and it’s no surprise that continuant-stage relationships don’t conform to this, nor is this a problem.

Based on these results, the authors go on to conclude:

Our scrutiny of the OBO Foundry candidate ontologies and cross products yielded a relatively high proportion of inappropriate usages of simple logical constructors. Only focusing on the proper use of existential restriction in class definitions, we found up to 23% of unintended consequences in these constructions. Many Foundry ontologies are widely used throughout the biomedical domain, and therefore such a high error rate seems surprising.

[my emphasis]. To a casual reader the “23%” sounds terrible. But remember:

  • the authors made mistakes in their evaluation – e.g. with GO
  • the authors over-interpreted in many cases, leading to inflation of numbers
  • in the case of roles, the problem is really in adhering to the BFO definition
  • the unintended consequences are largely philosophical consequences regarding existential dependence rather than consequences that would manifest computationally in reasoning.

The last sentence is telling:

Many Foundry ontologies are widely used throughout the biomedical domain, and therefore such a high error rate seems surprising

Indeed, it would be surprising if this were the case. The ZFA is used every day to power gene expression queries on the ZFIN website. Why haven’t any of their users cottoned on these errors? It’s a mystery. Or perhaps not. Perhaps the authors are seeing errors when there are in fact none.

Most of the existential restrictions are in fact, contrary to what the authors claim, intended to be existential restrictions. In some cases, such as the “smoking may cause cancer” type examples, the problems only exist on a philosophical level, and even then if you make certain philosophical assumptions. Saying smoking “SubClassOf may_cause some cancer” would be an example of an unintended consequence according to the authors, because it implies that every instance of smoking is existentially dependent on some instance of cancer, which is philosophically problematic (to some). Nevertheless, it’s well known many people working with OWL ontologies use this idiom because there’s no modal operator in OWL, it’s practical and gives the desired inferences. See What causes pneumonia for more discussion.

The authors go on:

We hypothesize that the main and only reason why this has little affected the usefulness of these ontologies up to now is due to their predominant use as controlled vocabularies rather than as computable ontologies. Misinterpretations of this sort can cause unforeseeable side effects once these ontologies are used for machine reasoning, and the use of logic-based reasoning based on biomedical ontologies is increasing with the advent of intelligent tools surrounding the adoption of the OWL language.

In fact it would be easy to test this hypothesis. If it were true, then it should be possible to add biologically justifiable disjointness axioms to the ontology and then use a reasoner to find the unsatisfiable classes that arise from the purported incorrect use of existential restrictions. It is a shame the authors did not take this empirical approach and instead opted for a more subjective ranking approach.

In fact the transition from weakly axiomatized ontologies to strongly axiomatized ones is happening, and this is uncovering a lot of problems through reasoning. But the problems being uncovered are generally not due to unintended consequences of existential quantifications. The authors widely miss the mark on their evaluation of the problem.

But the authors do end with an excellent point:

Another problem that hindered our experiments is the unclear ontological commitment of many classes and relations in OBO ontologies, which makes it nearly impossible to reach consensus about the truth-value of many of their axioms. This involves not only ambiguities in ontological interpretation of the classes included in the ontologies but also the proliferation of relations which were poorly defined. To address this shortcoming, ontologies can rely on more expressive languages and axiom systems in which the intended semantics of the relations used are constrained, as is done for the OBO relation ontology

The only objection I have there is to point out that most OBO ontologies don’t use a proliferation of relations – the authors are referring to some of the cross-product extensions here. But point taken – some relations need better definitions,  the cross product files are of variable quality and known issues should be documented.

If this were the thesis off the paper I would have less of a problem. However, the paper makes a stronger claim, namely that 23% of the existential restrictions are wrong and should be changed to some other logical constructor (with the implication that this is due to ambiguities in obo-format). Two (hopefully unintended) consequences of this paper are muddying the waters on the semantics of obo-format and spreading misinformation about the quality of relational statements in the OBO library.

This needs to be countered:

  • The official semantics of OBO-Format are such that every relationship tag is by default interpreted as a subclass of an existential restriction. This can be overridden in some circumstances, but in practice is rarely done, see
  • If something is not clear, ask on the obo-format list rather than basing a paper on a misunderstanding
  • Most obo-format ontology authors have sufficient understanding of the all-some semantics such that you should trust the OWL that comes out the other end.
  • If you don’t trust it, then report the problem to the ontology via their tracker.
  • If you think you’ve uncovered systematic errors in either the underlying ontology or in the translation to OWL, verify there really is an error using the appropriate mechanism (e.g. trackers) before jumping to conclusions and writing a paper falsely claiming a 23% error rate. There are in fact many problems with many ontologies, but unintended consequences of existential quantification is are not among them, except in a small number of cases (e.g. CHEBI), which have yet to be shown to cause any harm, but nevertheless need better documentation.