OntoTip

Shadow Concepts Considered Harmful

In this post I will describe “Shadow Concepts“, and why they are problematic. In the next forthcoming post I will describe strategies for mitigating these problems, and for avoiding them altogether.

Ontologists love to tease apart the world in fractal ways, considering the many different aspects of concepts and terms we take for granted. This kind of ontological analysis can be extremely helpful in extracting precise statements from ambiguous natural language or fuzzy thinking. For example, when we make statements about a drug like aspirin, do we mean the chemical entity, as represented in CHEBI, or a drug formulation consisting of both active ingredient and filler, as found in DRON? Is aspirin a singular entity, existing in chemical structure space, or is aspirin a set or class whose instances are the quadrillions of actual molecule instances existing in physical-temporal space located inside leaves of willow trees and drug packets?

Aspirin and aspirin

It can be both illuminating and fun to tease apart both everyday concepts and specialized scientific terms in this way; ‘carving ontology at the joints’ as the ontologists say. Clear thinking and unambiguous communication is undoubtedly important. However, like a lot of ontological analysis, a little moderation goes a long way. This is especially true when it comes to translating this analysis into concrete ontological products intended to be used by non-ontologists. There is a dangerous path that leads to a kind of ontological Balkanization, with concepts fragmented across different disconnected branches of ontologies, or across different ontologies.

Ontological Balkanization: The same family of concepts can find themselves spread over the ontological globe, with poor connections between them

This is prevalent in, and as far as I am aware, restricted to the life science / biomedical ontology space (and related environmental/ecological). I am not sure if this is due to something about our domain or knowledge, or more a matter of practice of the groups operating in this space. If this blog post doesn’t describe your experience of ontologies, then it may be less useful for you, although I hope that it will help inoculate you from any future ontologization epidemics.

In this post I will give examples of problems caused by these shadow concepts. But first, I will start by telling an ancient parable.

The Parable of the Intrepid Climber of Mount Ontolympus

One day an intrepid climber sought to ascend to the peak of the forbidding Mount Ontolympus. However, the climber was ill-prepared, and when the temperature dropped precipitously, he found himself feeling sick, shivering and confused. On performing a temperature assay with a thermometer, he confirmed he was suffering from hypothermia. He cried out for help, comforted in the knowledge that Mount Ontolympus was home to many wandering ontologist-physicians.

On hearing his plaintive cries, an ontologist immediately came to assist. “I have hypothermia, help me” said the climber, knowing it was good to be clear and direct with ontologists. The ontologist smiled and said “you are in luck my friend, for hypothermia is a class in my ontology, where is it clearly a type of measurement. Specifically, a measurement of the temperature of a multicellular organism”. He grabbed the thermometer from the climber and crushed it beneath his feet, exclaiming “without a measurement device, there is no measurement, and thus no hypothermia instance”. The ontologist then triumphantly strode down the mountain, with a shout of “You’re welcome!”.

Seeing the obvious increasing distress of the climber, a second ontologist approached. “I hear you have hypothermia, good sir. You are in luck, for hypothermia is a class in my ontology, where it is clearly a type of information entity, specifically a ‘datum’ that is about a measurement of a temperature of a bodily core. To dispel this instance, you must destroy all hard drives in which the concretizations in which this generically dependent continuant are borne, thus vanquishing the datum”. A troubled look crossed the face of the second ontologists, before he continued: “hmm, neurophysical substrates can also be the bearer of such concretizations,… including the pattern of synaptic wiring in my own brain”. The second ontologist then proceeded to run towards the nearest cliff, which he threw himself from, with the final cry of “your welcome!” echoing throughout the valley below.

The climber was now in peril. A third ontologist approached. She informed the climber: “Sir, you have hypothermia, which is clearly a quality of decreased temperature that inheres in your bodily core. To eliminate this instance, we must increase your body temperature. Here, take this blanket and wrap yourself in it, and sup warm liquids from my flask, …”. But it was too late, the climber’s heart had stopped beating, he had succumbed to his hypothermia, and now lay dead at the third ontologist’s feet.

Upper ontologies lead to shadow concept proliferation

Like all parables, this isn’t intended to be taken literally, and to my knowledge over-ontologization has never led to physical harm beyond moderately increased blood pressure. But there is a core of truth here in how we ontologists are liable to confuse people by insisting on narrowing in on our favorite aspect of a concept, and to waste time chasing different ‘ontological shadows’ rather than focusing on the ‘core’ aspect that matters most to people. This is not to say that it’s not important to distinguish these entities – but there is a time, place, and methodology for doing this.

The Basic Formal Ontology (BFO) is a great tool for analyzing these different aspects of common concepts. It provides a set of upper level categories that subdivide the world roughly similar to Aristotle’s categories. BFO allows us to take something like the class of human hearts and examine the different ontological aspects entailed by the existence of a heart. This includes the site and spatial region at which the heart resides, the disposition of a heart (to pump blood, to make a thumping noise), the realization of those dispositions as processes, the life history of a heart, the representation of a heart such as in a drawing by Vesalius, and so on.

This is illustrated in the following diagram:

The different aspects of a heart, and how those aspects are related. The heart disposition inheres in the heart, which is realized in a heart process. The heart “occupies” the heart site, the heart history “occupies” the heart spatiotemporal region, which itself spatially projects onto the heart spatial region. Other aspects such as heart information are not shown. The concept of a heart could be considered core here, with other aspects shadowing them

BFO is a great tool for analyzing what we mean when we talk about hearts or genes or any other scientific term. Those that are philosophically inclined like to discuss the pros or cons of different upper ontology representations, and there is ongoing vigorous debate about some parts of the above diagram. But that’s not what I am interested in discussing today, and in some ways these debates are irrelevant.

The problem arises when the upper ontology goes beyond an analytic tool, and ontologists start building ontologies that follow this model too literally, creating a vast cross-product of terms that intersect scientific terms such as parts of hearts with shadow ontological aspects. The end result is a ragged lattice, sometimes Balkanized across ontologies, with conceptual drift between the components, confusing users and leading to inconsistency and maintenance tar pits.

Some specific areas where I have seen this problem manifest:

  • Geological entities, such as islands (which can be conceived of as material entities in BFO terms), and various shadows such as the area occupied by the island (an immaterial entity).
  • Ontologies which follow the OGMS model end up creating many shadow concepts for a disease D, including the disease process, the disease qua disposition, the disease diagnosis (a generically dependent continuant), a concretization of that diagnosis, and many more. For an example of what happens here, try searching CIDO for the different ‘aspects’ of COVID-19
  • Biological information macromolecules or macromolecule regions (e.g. DNA regions) can have many aspects, from the material molecular entity itself, through to ‘abstract’ sequences (generically dependent continuants), representation of those sequences in computers (also generically dependent continuants, but concretized by bits and bytes on a hard disk rather than in the mind-independent molecule itself)
  • Organismal traits or physical characteristics such as axillary temperature and their information shadows such as axillary temperature measurement datum.

Apologies for the ontological jargon in these descriptions – follow the links to get elucidation of these, but the specific jargon and its meaning isn’t really the point here (although the obfuscated nature of the terminology is a part of the problem).

Shadow concepts lead to conceptual drift and inconsistent hierarchies

Let’s say we have two branches in our ontology, one for material environmental entities such as tropical forests, and another for their immaterial shadows, e.g. tropical forest areas. Ostensibly this is to serve different use cases. An ecologist may be interested in the material entity and its causal influence. A geographer may be interested in the abstracted area as found on a map (I am skeptical about the perceived need for materializing this distinction as separate ontology terms, but I will return to this later).

Problems arise when these branches evolve separately. Let’s say that in response to a request from an ecologist we introduce a subclass ‘tropical broadleaf forest‘ in the material entity hierarchy, but as this didn’t come from a geographer we leave the parallel area hierarchy untouched. Immediately we have a point of confusion – is the omission of a shadow ‘tropical broadleaf forest area‘ class in the area hierarchy intentional, reflecting different use cases between ecologists and geographers, or is it unintentional, reflecting the fact that synchronizing ontology branches is usually poorly handled in ontologies? See The Open World Assumption Considered Harmful for more on this.

The problems worsen if a geographer requests a class ‘broadleaf forest area‘. This seems like it’s broader than our previously introduced ‘tropical broadleaf forest’ class but in fact is broader on a different axis.

Shadow concept hierarchies have a tendency to drift and become inconsistent. Note these two hierarchies are different ontological categories, so they cannot be related by is-a links

This is all very confusing for the typical user who does not care about fine-grained philosophical upper-ontology distinctions and is just looking for terms encompassing ‘tropical forests’. In one hierarchy they get tropical broadleaf forests, in another they don’t get anything relating to broadleaf forests.

This is a small simplified example, the actual confusion caused multiplies when hierarchies grow. Furthermore, other aspects of the hierarchies drift, such as textual definitions, and again it is hard to tell if these differences are intentional reflecting the different ontological nature of the two hierarchies, or if it is just drift.

Needless to say this can cause huge maintenance headaches for ontology developers. This is something we cannot afford, given how few resources and expertise we have for building these ontologies and keeping them up to date.

There are techniques for syncing hierarchies that I will return to later, but these are not cost-free, and have downsides.

Shadow concepts that span ontologies are especially problematic

Syncing shadow concepts within an ontology maintained by a single group is hard enough; problems are vastly exacerbated when the shadows are Balkanized across different ontologies, especially when these ontologies are maintained by different groups.

One example of this is the fragmentation of concepts between OBA (an ontology of biological traits or attributes) and OBI (an ontology biomedical investigations). OBA contains classes that are mostly compositions of some kind of attribute (temperature, height, mass, etc) and some organismal entity or process (e.g. armpit, eye, cell nucleus, apoptosis). OBA is built following the Rector Normalization Pattern using DOSDPs following the EQ pattern.

OBI contains classes representing core entities such as assays and investigations, as well as some datum shadow classes such as axillary temperature measurement datum, defined as A temperature measurement datum that is about an axilla.

Immediately we see here there is a problem from the perspective of ontology modularization. Modularization is already hard, especially when coordinated among many different ontology building groups with different objectives. But ostensibly in OBO we should have ‘orthogonal’ ontologies with clear scope. And on the one hand, we do have modularization and scope here: datums (as in datum classes, not actual data) arising from experimental investigations go in OBI, the core traits which those data are about go in OBA.

But from a maintenance and usability perspective this is really problematic. It is especially hard to synchronize the hierarchies and leverage work done by one group for another due to the fact the terms are in different ontologies with different ontology building processes and design perspectives. Users are frequently confused. Many users take a “pick and mix” approach to finding terms, ignoring ontological fine details and collecting terms that are ‘close enough’ (see How to select and request terms from ontologies). Rather than request a term ‘axillary temperature’ from OBA they may take the OBI term, and end up with a mixture of datum shadows and core concepts. This actually has multiple negative ramifications from multiple perspectives: reasoning, text mining, search, usability. It also goes against plain common sense – things that are alike should go in the same ontology, things that are unalike go in different ones.

Case Study: The Molecular Sequence Ontology

A classic shadow concept use case is distinguishing between the molecular aspect of genomic entities (genes, gene products, regulatory regions, introns, etc) and their sequence/information aspect. This is not idle philosophizing – it lies at the heart of eternal questions of ‘what is a gene‘. It is important for giving clear answers to questions like ‘how many genes in the human genome’.

I won’t go into the nuances here, interesting as they are. The important thing to know here was that the Sequence Ontology (SO) was originally formally ontologically uncommitted with regards to BFO. What this means is that neither the genomics use case driven developers of SO not the users particularly cared about the distinction between molecules and information. One of the main uses for SO is typing features in GFF files, querying genomics databases, and viewing sequences in genome browsers (this probably makes it the most widely instantiated biological ontology, with trillions of SO instances, compared with the billions in GO). Users are typically OK with a gene simultaneously being an abstract piece of information carried by DNA molecules, an actual molecular region, and a representation in a computer. But this led to a few complications when working with an ontology.

To rectify this, in 2011, I proposed the creation of a ‘molecular sequence ontology‘. I am quite proud of the paper and the work my colleagues and I did on this, but I take full responsibility for the part of this paper that proposed the ‘SOM’ (Sequence Ontology of Molecules, later MSO), which I now regard as a mistake. The basic idea is captured in Table 2 of the paper.

Table 2 of the paper introducing SOM.

the basic idea was to ‘split’ SO into two ontologies that would largely shadow each other, one branch encompassing information entities (the SO would be ‘recast’ as this branch), and another branch for the molecular entities (MSO).

A number of good papers came out of this effort:

Intelligently Designed Ontology Alignment: A Case Study from the Sequence Ontology

Michael Sinclair, Michael Bada, and Karen Eilbeck

ISMB Bio-ontologies track conference proceeding 2018

Efforts toward a More Consistent and Interoperable Sequence Ontology

Michael Bada and Karen Eilbeck. Conference Proceeding ICBO 2012

Toward a Richer Representation of Sequence Variation in the Sequence Ontology

Michael Bada and Karen Eilbeck. Annotation, Interpretation and Management of Mutations. A workshop at ECCB10.

These are definitely worth reading for understanding the motivation for MSO, the challenges for developing and synchronizing it, and improvements that were made. However, ultimately the challenges in syncing the ontologies proved greater than expected, and did not stack well against the benefits, and resources ran out before a number of the benefits could be realized.

At the Biocuration workshop in 2017 there was a birds of a feather session featuring the MSO developers plus a large cross-representation of the curators that use SO for annotating genomic data. The outcome was that not everyone saw the need for splitting into two ontologies, and many found the ontological motivation confusing. However, everyone saw the value of the ontology editing effort that had gone into MSO (which included many improvements orthogonal to the ontological recasting), and so the consensus was to fold in any changes made into MSO back into SO that did not pertain to the fine-grained MSO/SO distinction. Unfortunately this was quite challenging as the ontologies had diverged in different ways, and not all the changes have been folded back in.

I think there is a lesson there for us all — we should make sure we have buy-in from actual users and not just ontologists before making decisions about how to shape and recast our ontologies. “Ontological perfection” can seem a lofty goal, especially for newcomers to ontology development, but this has to be balanced against practical use cases. And crucially we should be careful not to underestimate development, maintenance and synchronization costs. It’s easy to ignore these especially when building something from scratch, but just like in software engineering, maintenance eats up the lions share of the cost of developing an ontology.

Strategies for managing and avoiding shadow concepts

In the next post I will provide some strategies for mitigating and avoiding these problems.

Where shadow concepts are unavoidable, templating systems like DOSDPs should be used to sync the hierarchies (ideally all aspects of the shadow, including lexical information, should mirror the core term). This is necessary but not sufficient. There needs to be extra rigor and various workflow and social coordination processes in place to ensure synchrony, especially for inter-ontology shadows. There are a lot of reasoning traps the unwary can fall into.

I will also review various conflation techniques including what I call the Schulzian transform. These attempt to “stitch together” fragmented classes back into intuitive concepts.

I will also review cases where I think materialization of certain classes of shadow are genuinely useful. These include things such as genes and their ‘reference’ proteins, genes and their functions, or infectious diseases like COVID-19 and the taxonomic classification of their agents like SARS-CoV-2. Nevertheless it can still be useful to have strategies to selectively conflate these.

But really these mitigation and conflation techniques should not be necessary most of the time: in most cases there is no need to create shadow hierarchies. I can’t emphasize this strongly enough. If you want to talk about a COVID diagnosis (which is undoubtedly a different ontological entity than the instance of the disease itself) then you don’t need to materialize a shadow of the whole disease hierarchy. Just use the core class! You can represent the aspect separately as ‘post-composition’. The same goes if you feel yourself wanting classes with the string “datum” appended at the end. Instead of making a semi-competing shadow ontology, work on the core ontology, and if you really really need to talk about the “X datum” rather than “X” itself (and I doubt you do), just do this via post-coordination. I will show examples in the next post, but in many ways this should just be common sense.

Advertisement
Announcements · linkml · Ontologies · Standards

Using ontologies within data models and standards

Ontologies are ways of organizing collections of concepts that can be used to annotate data. Ontologies are complemented by schemas, data models, reporting formats and standards, which describe how data should be be structured.

Some examples:

As an example how ontologies are used, the following figure illustrates the schema used for the Data Coordination Center for the ENCODE project (taken from Malladi et al https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4360730/)

Figure 2 from Ontology application and use at the ENCODE DCC (PMC4360730)

You can get a sense for which ontologies are used for which standards on the FAIRsharing site – for example for Uberon:

Fairsharing usage graph for Uberon (link)

Despite this ubiquity, most ontologies are bound to standards and schemas in a very loose way. Usually there is accompanying free text instructing humans to do something like “use branch B of ontology O for filling in this field”, but these instructions are open to interpretation and don’t avail themselves to automated computational checking, or for driving data entry forms.

Furthermore, many of these ontologies are huge and selecting the right term can be a challenge for whoever is providing the metadata – a scientist submitting the data, a data wrangling providing first-pass curation, or a biocurator at a knowldge base. There is a missed opportunity to provide selected subsets of the ontology that are relevant to a particular use case. For example, when providing environmental context for a terrestrial soil sample it’s not necessary to use all three thousand terms in ENVO, terms for marine biomes or terms that describe the insides of buildings or humans are not relevant.

And in fact many schema languages like JSON-Schema lack the ability to bind ontologies to field values. There is a need for this demonstrated by this discussion on the JSON-Schema discussion group. Many groups (Human Cell Atlas, AIRR) have developed their own bespoke extensions to JSON schema or related formalisms like Open API that allow the schema designer to specify a query for ontology terms that is then executed against a service like the Ontology Lookup Service. Here is an example from AIRR:

        species:
            $ref: '#/Ontology'
            nullable: false
            description: Binomial designation of subject's species
            title: Organism
            example:
                id: NCBITAXON:9606
                label: Homo sapiens
            x-airr:
                miairr: essential
                adc-query-support: true
                set: 1
                subset: subject
                name: Organism
                format: ontology
                ontology:
                    draft: false
                    top_node:
                        id: NCBITAXON:7776
                        label: Gnathostomata

However, none of these JSON-Schema extensions are official, and from the discussion it seems this is unlikely to happen soon. JSON-Schema does offer enums, a common construct in data modeling, which allows field values to be constrained to a simple flat dropdown of strings. Standard enums are not FAIR, because they offer no way to map these strings to standard terms from ontologies.

The clinical informatics community has dealt with the problem of combining ontologies and terminologies with data models for some time, and after various iterations of HL7, the FHIR (Fast Healthcare Interoperability Resources) provides a way to do this via Value Sets. However, the FHIR solution is tightly bound with the FHIR data model which is not always appropriate for modeling non-healthcare data.

Using Ontologies with LinkML

LinkML (Linked Data Modeling Language) is a polymorphic pluralistic data modeling language designed to work in concert with and yet extend and add semantics to existing data modeling solutions such as JSON-Schema and SQL DDL. LinkML provides a simple standard way to describe schemas, data models, reporting standards, data dictionaries, and to optionally adorn these with semantics via IRIs from standard vocabularies and ontologies, ranging from schema.org through to rich OBO ontologies.

From the outset, LinkML has supported the ability to provide “annotated enums” (also known as Value Sets), extending the semantics of enums of JSON-Schema, SQL, and object oriented languages.

For example, the following enum illustrates a set of hardcoded Permissible Values mapped to to terms from the GA4GH pedigree standard kinship ontology

enums:
  FamilialRelationshipType:
    permissible_values:
      SIBLING OF:
        description: A family relationship where the two members have a parent on common
        meaning: kin:KIN_007
      PARENT OF:
        description: A family relationship between offspring and their parent
        meaning: kin:KIN_003
      CHILD OF:
        description: inverse of the PARENT_OF relationship
        meaning: kin:KIN_002

Each permissible value is optionally annotated with a “meaning” tag, which has a CURIE denoting a term from a vocabulary or external resource. Each permissible value can also be adorned with a description, as well as other metadata.

The use of ontology mappings here gives us an interoperability hook – two different schemas can interoperate via the use of shared standard terms, even if they want to present strings that are familiar to a local community.

This all works well for relatively small value sets. But what happens if I want to have a field in my schema whose value can be populated by any term from the Eukaroyte branch of NCBI Taxonomy? It’s not really practical to include a ginormous list of IDs and terms in a schema – especially if it’s liable to change.

Dynamic Enums

Starting with the new LinkML 1.3 model release today, it is possible to describe dynamic value sets in your schema. Rather than hardcoding a list of terms, you instead provide a declarative statement saying how to construct the value set, allowing binding to terms to happen later.

Let’s say you want a field in your standard to be populated with any of the subtypes of “neuron” in the cell ontology, you could do it like this:

enums:
  NeuronTypeEnum:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000540 ## neuron
      include_self: false
      relationship_types:
        - rdfs:subClassOf

This value set is defined in terms of an ontology graph query, any term that is reachable from the node representing the base neuron class, walking down subClassOf edges.

Dynamic enums can also be composed together, either by nesting boolean expressions to arbitrary depth, or by naming sub-patterns and reusing them.

A more complex example involves the use of two enums, the first a general one for any kind of disease, and the second one both extends and restricts this, extending it to include cancer terms from a different vocabulary, and restricting to exclude non-human animal diseases:

enums:
  Disease:
    reachable_from:
      source_ontology: obo:mondo
      source_nodes:
        - MONDO:0000001 ## disease or disorder
      is_direct: false
      relationship_types:
        - rdfs:subClassOf

  HumanDisease:
    description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases
    inherits:
      - Disease
    include:
      - reachable_from:
          source_ontology: obo:ncit
          source_nodes:
            - NCIT:C3262
    minus:
      - reachable_from:
          source_ontology: obo:mondo
          source_nodes:
            - MONDO:0005583 ## non-human animal disease
          relationship_types:
            - rdfs:subClassOf

Tooling support

There is already support for LinkML static enums. When LinkML is used to generate JSON-Schema or SQL DDL then these are mapped to the corresponding constructs (but with loss of metadata).

LinkML-native tools such as Data Harmonizer support enums as hierarchical drop-downs in data entry.

For the newer dynamic enums, currently the focus is on standardizing how to specify these in a standard way, with expectations of tooling support to come soon.

This is something that is difficult to have a one-size fits all solution, due to the variety of use cases here. Consider the task of validating a schema with dynamic enums. In some cases, you may want to do this at run time, with a dynamic query against a remove ontology server. In other cases you might prefer not avoid network dependency at validation time, instead opting for either local ontology lookup, or a pre-materialization of the value set. One way to do this is to add in ontology lookups at time of compiling a JSON-Schema.

For the ontology lookups there are a variety of options. Bioportal includes almost a thousand ontologies covering most terminological uses in the life sciences, with ontologies in the OntoPortal alliance covering other domains like agriculture, ecology, and materials science. OLS also has a large number of ontologies also in the life sciences. But these ontology browsers don’t cover all entities, such as all proteins in UniProt. And for some use cases it may be necessary to use bespoke vocabularies. Tooling support for dynamic enums should cover all these scenarios.

This is something that could be supplied using the new Ontology Access Kit (OAK) library which provides a unifying layer over multiple different ontology sources, including hosted ontology portals like Bioportal, OLS, and Ontobee, local files in a variety of formats, local and remote SPARQL endpoints, including both biological databases, as well as Wikidata and Linked Open Vocabularies.

Trying it out

Head over to the LinkML documentation for information and tutorials on how to use LinkML and LinkML enums. Also check out schemasheets, an easy way to author schemas/standards, including use of enums, all using google sheets (or Excel, if you must).

Whether you opt to use LinkML or not, if you are involved in the creation of standards or data models, or if you are involved in the provision of ontologies that are used in these standards, I hope this post is useful for you in thinking about how to structure things in a way that makes these standards more rigorous, easier to use by scientists, and easier to systematically validate and automatically drive interfaces!

Uncategorized

Ontotip: Avoid the single child anti-pattern

This article is part of the OntoTips series.

A common structure found in many ontologies is the single child pattern. I consider this an anti-pattern, to be avoided.

The most common form is with is_a children (i.e subClassOf between two named classes), but the anti-pattern also applies to other relationship types. We can formalize the single child subclass pattern as:

  • C1 direct SubClassOf P
  • NOT exists some C2, such as C2 is a direct SubClassof P, and C2 != C1

Depicted as:

C1 is the only child of P

One reason this is an anti-pattern is that it is inherently incomplete. i.e. there must be instances of P that are not instances of C1 (otherwise why have two classes – see the reflexive subclass anti-pattern). Following a principle of reasonable completeness (see open world post) we should include sibling terms where appropriate.

Here is a concrete example from a fictional ontology:

A subset of a fictional ontology, where only one subtype of flu (mild flu) is populated

Here there is a single specialization of a disease term, based on severity.

Another example (adapted from an existing ontology):

A fictional example, where a chemical assay is subtyped by a property of that chemical pool (its concentration)

Here there is a specialization of the assay term based on a property of the pool or iron.

A different example (adapted from an existing ontology):

An adapted example, where a specific assay has a single subtype, differentiated by location of sampling

This kind of structure is not uncommon in many OBO ontologies. And there is a reasonable defense: we have limited ontology editing resources, and many terms are added on request. Curators are free to request a more specific term if they feel it is necessary for annotating (e.g a disease that has as phenotype mild flu) but they may have no need for the implicit sibling terms. And ontology developers see no need to do additional work they are not requested to do.

However, this leads to lopsided ontologies that are often confusing for people not deeply immersed the development of these ontologies. It is hard to tell if omissions are intentional or unintentional. And the practice of instantiating single children has bad downstream effects of annotation, this is something we have frequently observed over multiple ontologies.

Consider the flu example above. A new annotator may want to annotate a disease that has a severe flu phenotype. They may make an implicit assumption that choosing the parent term ‘flu’ would communicate ‘severe flu’; if it was mild, they would have selected ‘mild flu’. But this is not the explicit assertion they are making – they are making a closed world assumption that doesn’t hold for the logic of the ontology. While some of this can be obviated with training, and ensuring curators request specific sibling terms rather than trying to let the parent do the work. But many single-child cases are in fact more nuanced that the flu example.

Instead, it is better to take a more prospective approach to ontology development, try and anticipate in advance terms that may be required, and populate them in a balanced fashion – this will result in more balanced annotations. It is much easier to do this if you follow OWL axiomatization and have a formal design pattern system such as DOSDPs. In fact you can use such a system to automate detection of single-child patterns and imbalances.

While it is trivial to detect single is-a children using a SPARQL query encoding the pattern above, it won’t capture the more nuanced cases of single children by a given axis of classification.

Consider this made-up ontology structure, where we have a parent class with only two subclasses explicitly populated:


made-up example, the class ‘animal’ has only two child classes populated, one by a taxonomic axis (vertebrate) another by a property of the animal (its edibility)

In this particular example, both are also single-children via an axis of classification. While on a gross structural level the lower terms each has a sibling, each sibling is clearly classified differently. The first is classified along classical taxonomic/evolutionary descent terms, the second is by a different property.

The above example is made-up and would strike most people as bad design (even if strictly logically coherent). Where is the concept of inedible animals, where are the invertebrates (and indeed edible and inedible vertebrates and invertebrates)?

But in fact this antipattern plagues most OBO ontologies. These are harder to spot, especially if the ontology is unaxiomatized.

For example:

iron assay with two child terms along different axes

Structurally this doesn’t look like a single-child anti-pattern, but it is in fact an example of a single-child-by-axis pattern. And if there are no subclasses, this is an instance of the ragged lattice pattern, which I will cover in a future post.

While these can’t be detected by straightforward SPARQL queries, if you use a system such as DOSDPs you can use this to analyze your ontology for these structures, and proactively guard against them. 

While the above examples all focus on subClassOf/is-a relations, the same guidelines apply to other edge labels. For example, if an anatomy ontology only listed a single part of the head (such as the mouth):

An incomplete ontology with only one part child for head)

Most people would consider this poor design. While of course it’s true it’s unreasonable to expect ontologies to be complete, the reasonable completeness principle should apply, and if for some reason this not unattainable, at the very least the incompleteness needs to be clearly documented.

In closing, as ontology developers it can be tempting to ignore these single child cases – we have limited resources, and must balance this with being able to provide users with terms they request, which may lead to spottiness and incompleteness. But ignoring these just leads to more work downstream, and in some cases it can lead to incomplete annotation. So avoid single is-a children!

Uncategorized

Debugging Ontologies using OWL Reasoning, Part 3: robot explain

This is the 3rd part in a series. See part two.

In the first part of this series, I covered the use of disjointness axioms to make it easier to detect logical errors in your ontologies, and how you could use robot reason as part of an ontology release workflow to avoid accidentally releasing incoherent ontologies.

In the second part I covered unintentional inference of equivalence axioms – something that is not inherently incoherent, yet is usually unintended, and how to configure robot to catch these.

In both cases, the standard operating procedure was:

  • Detect the incoherency using robot
  • Diagnose the incoherency using the “explain” feature in Protege
  • Repair the problem my removing or changing offending axioms, either in your ontology (or if you are unlucky, the issue is upstream and you need to coordinate with the developer of an upstream ontology)

In practice, repairing these issues can be very hard. This is compounded if the ontology uses complex hard-to-mentally-reason-over OWL axioms involving deep nesting and unusual features, or if the ontology has ad-hoc axioms not conforming to design patterns. Sometimes even experienced ontology developers can be confounded by long complex chains of axioms in explanations.

But never fear! Help is at hand, there are many in the OBO community who can help! I always recommend making an issue in GitHub as soon as you detect an incoherency. However, you want to avoid having other people do the duplicative work of diagnosing the incoherency. They may need to clone your repo, fire up Protege, wait for the reasoner to sync, etc. You can help people help you by providing as much information up-front as possible.

Previously my recommendation was to paste a screenshot of the Protege explanation in the ticket. This helps a lot as often I can look at one of these and immediately tell what the problem is and how to fix it.

But this was highly imperfect. Screenshots are not searchable by the GitHub search interface, they are not accessible, and the individual classes in the screenshot are not hyperlinked.

A relatively new feature of robot is the explain command, which allows you to generate explanations without firing up Protege. Furthermore, you can generate explanations in markdown format, and if you paste this markdown directly into a ticket it will render beautifully, with all terms clickable!

A recent example was debugging an issue related to fog in ENVO. As someone who lives in the Bay Area, I have a lot of familiarity with fog.

The explanation is rendered as nested lists:

Both the relations (object properties) and classes are hyperlinked, so if you want to find out more about rime just click on it.

In this case the issue is caused by the use of  results in formation of where the subject is a material entity, whereas it is intended for processes. This was an example of a “cryptic incoherency”. It went undetected because the complete set of RO axioms were not imported into ENVO (I will cover imports and their challenges in a future post)

The robot explain command is quite flexible, as can be seen from the online help. I usually use it set to report all incoherencies (unsatisfiable classes plus inconsistencies). Sometimes if you have an unsatisfiable class high up in the hierarchy (or high up in the existential dependency graph) then all subclasses/dependent classes will be unsastifiable. In these cases it can help to hone in on the root cause, so the “mode” option can help here.

File:Golden Fog, San Francisco.jpg - Wikimedia Commons
Uncategorized

Edge properties, part 2: singleton property pattern (and why it doesn’t work)

In the first post in this series on edge properties, I outlined the common reification pattern for placing information on edges in RDF and knowledge graphs, and how this places nicely with RDFStar (formerly RDF*).

There are alternatives to reification/RDF* for placing information on edges with the context of RDF, in this post I will deal with one, the Singleton Property Pattern (SPP), described in Don’t Like RDF Reification? Making Statements about Statements Using Singleton Property (2015), Nguyen et al (PMC4350149).

The idea here is to mint a new property URI for every statement we wish to talk about. Given a triple S P O, we would rewrite as S new(P) O, and add a triple new(P) singlePropertyOf P.

So given the example in the previous post:

Using SPP we would have the following 3 triples:

  • :P1 :interacts_with_001 :P2
  • :interacts_with_001 singletonPropertyOf :interacts_with
  • :interacts_with_001 :supported_by :pmid123

To find all interacts-with associations, we could write a SPARQL query such as:

SELECT ?x ?y WHERE { ?x ?p ?y . ?p singletonPropertyOf :interacts_with }

Some of the proposed advantages of this scheme are discussed in the paper.

A variant of the SPP makes the new property a subPropertyOf the original property. I will call this SPP(sub). This has the advantage that the original statement is still entailed.

  • :P1 :interacts_with_001 :P2
  • :interacts_with_001 rdfs:subPropertyOf :interacts_with
  • :interacts_with_001 :supported_by :pmid123

Then if we assume RDFS entailment, the query becomes simpler:

SELECT ?x ?y WHERE { ?x :interacts_with ?y }

This is because the direct interacts_with triple is entailed due to the subPropertyOf axiom.

In the discussion section of the original paper the authors state that the pattern is compatible with OWL, but don’t provide examples or elucidation, or what is meant by compatible. There is a website that includes an OWL example.

Unfortunately, problems arise if we want to use this same pattern with other axiom types, beyond OWL Object Property Assertions; for example, SubClassOf, EquivalentClasses, SameAs. This is true for both the original SPP and the subPropertyOf variant form.

These problems arise when working in OWL2-DL. Use of OWL2-Full may eliminate these problems, but most OWL tooling and stacks are built on OWL2-DL, so it is important to be aware of the consequences.

OWL-DL is incompatible with the SPP

Unfortunately the SPP pattern only works if the property used in the statement is an actual OWL property.

In the official OWL2 mapping to RDF, predicates such as owl:equivalentClass, owl:sameAs, owl:subClassOf are syntax for constructing an OWL axiom of the corresponding type (respectively EquivalentClasses, SameIndividual and SubClassOf). These axiom types have no corresponding properties at the OWL-DL level.

For example, consider the case where we want to make an equivalence statement between two classes, and then talk about that statement using the SPP pattern. 

We do this by making a fresh property (here, arbitrarily called equivalentClass-1) that is the SPP form of owl:equivalentClass. For comparison purposes, we make a normal equivalence axiom between P1 and P2, and we also try an equivalence axiom between P3 and P4 using the SPP(sub) pattern: 

PREFIX : <http://example.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

:P1 a owl:Class .
:P2 a owl:Class .
:P3 a owl:Class .
:P4 a owl:Class .

:P1 owl:equivalentClass :P2 .
:P3 :equivalentClass-1 :P4 .
:equivalentClass-1 rdfs:subPropertyOf owl:equivalentClass .

Note that this does not have the intended entailments. We can see this by doing a DL query in Protege. Here we use HermiT but other DL reasoners exhibit the same features.

For the normal equivalence axiom, we get expected entailments (namely that P2 is equivalent to P1):

However, we do not get the intended entailment that P3 is equivalent to P4

We can get more of a clue as to what is going on by converting the above turtle to OWL functional notation, which makes the DL assertions more explicit:

Prefix(:=<http://example.org/>)
Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)


Ontology(
Declaration(Class(:P1))
Declaration(Class(:P2))
Declaration(Class(:P3))
Declaration(Class(:P4))
Declaration(AnnotationProperty(:equivalentClass-1))
Declaration(AnnotationProperty(owl:equivalentClass))

SubAnnotationPropertyOf(:equivalentClass-1 owl:equivalentClass)
EquivalentClasses(:P1 :P2)
AnnotationAssertion(:equivalentClass-1 :P3 :P4)
)

Note that the SPP-ized equivalence axiom has been translated into an owl annotation assertion, which has no logical semantics. However, even if we force the SPP property to be an object property, we still lack the required entailments. In both cases, we induce owl:equivalentClass to be an annotation property, when in fact it is not a property at all in OWL-DL.

The SPP(e) pattern may work provide the intended entailments when using OWL-Full, but this remains to be tested.

Even if this is the case, for pragmatic purposes most of the reasoners in use in the life science ontologies realm are OWL2-DL or a sub-profile like EL++ or RL (e.g. Elk, HermiT).

Use of SPP with plain RDF

If you are only concerned with plain RDF and not OWL, then the above may not be a concern for you. If SPP’s work for your use case, go ahead. But bear in mind that this pattern is not as widely used, and in addition to OWL issues, there may be other reasons to avoid – for example, proliferating the number of properties in your graph may confuse humans and break certain built in assumptions made by some tools. Overall my recommendation would be to avoid the SPP regardless of your use case.

It’s unfortunate that something as basic as placing information on an edge leads to such a proliferation of de-facto standards in the RDF world. I think this is one reason why graph databases such as Neo4J have the edge – they make this much easier, and they don’t force poor users to mentally reason over confusing combinations of W3C specifications and understand esoteric things such as the difference between OWL-DL and OWL-Full.

Hopefully RDFStar will solve some of these issues.

Uncategorized

How to select and request terms from ontologies

Background

Ontologies, knowledge models, and other kinds of standards are generally not static artefacts. They are created to serve a community, which likely includes you, and they should respond dynamically to serve that community, where resources allow.

The content of many ontologies in OBO such as the Gene Ontology are driven by their respective curator communities. Ontology editors make terms and whole ontology branches prospectively in anticipation of needs, but they also make terms and changes in response to curator needs. New terms can also be requested by data modelers, data engineers, and data scientists, for example, to map categorical data in a dataset in order to allow harmonization and cleaning of data. In many cases, the terms may exist already, they may be hard to find, or they may be spread across a distribution of ontologies, and this may be confusing, especially if you are not familiar with the area.

I wrote this guide primarily with the data engineer audience in mind, as this community may be less familiar with norms and tacit knowledge around the ontology development lifecycle. However, much of what I say is applicable to curators as well. I also wrote this document originally for members of my group, who are expected to contribute back to OBO and the work of ontology developers. Some of the recommendations may therefore seem a little onerous in places, but they should still hopefully be useful and adaptable to a broader audience. And in all cases, for open community ontologies, it is worth bearing in mind the maxim that ‘you get out what you put in’.

User stories

  • As a curator (e.g BgeeDb), I want a new anatomy term, to curate expression for a gene
  • As a data scientist, I need terms describing oxygen-requirement traits, so I can combine tabular microbial trait data to predict traits from other features
    • Alternate scenario: building a microbial Knowledge Graph (see this issue)
  • As a database developer, I need terms for sequence variants, so I can map categorical values in my database making it FAIR
  • As a knowledge graph builder, I need relations from the Relation Ontology or Biolink, so I can standardize edge labels in my graph
  • As a GO ontology developer, I need terms from the cell ontology, so I can provide logical defining axioms for terms in the cell differentiation branch
  • As a microbiome scientist, I need terms from ENVO/PO, so I can fill in MIxS-compliant environmental fields when I submit my sample data, making it FAIR
  • As an environmental genomics standards provider, I need terms from ENVO, so I can map enumerated values/dropdowns to an ontology when developing the MIxS standard
  • As a data modeler / standards provider, I need SO terms for genomic feature types, to define a value set for a genomics exchange format (e.g. GFF)
  • As a schema developer, I need terms for describing properties of sequence assemblies, e.g. number of aligned reads, N50, in order to make my sequencing schema FAIR

These scenarios encompass a range of different kinds of person with varying levels of expertise and commitment. The primary audience for this document is members of my group and the projects we are involved with (GO, Monarch, NMDC, Translator) but many of the recommendations will apply more broadly. But we would not expect the average scientist who is submitting a dataset to engage at the same level through GitHub etc (see the end of the document for discussion on approaches for making the overall process easier). 

20ish simple rules for selecting and requesting ontology terms

Be a good open science citizen

The work you are doing is part of a larger open science project, and you should have a community minded attribute.

When you request terms from ontologies and you provide information to help you should be microcredited, e.g. your orcid will be associated with the term. Remember that many efforts are voluntary or unfunded, and people are not necessarily paid to help you. Provide help where you can, provide context when making requests, and any background explicit or tacit knowledge that may help.

Always be respectful and appreciative when interacting with providers of terms or other curators. Follow codes of conduct. 

Use the appropriate ontology or standard: avoid pick-and-mix

Depending on the context of your project, there may be mandated or recommended standards or ontologies. These may be explicit or implicit.

If you are performing curation, it is likely that the ontology you use is fixed by your curation best practices and even your tools. For example, GO curators (obviously) use GO. But for other purposes it may not be obvious which ontology to use (and even with GO, curators have a choice of ontologies for providing additional context as extensions or in GO-CAMs). There are a large number of them, with confusing overlaps, and lots of tacit community knowledge that may not be immediately available to you.

Some general guidelines:

  • Look in the appropriate place, depending on what kind of term you need
    • If you need a classification term or a descriptor, then use an ontology
    • If you need something like a gene or a variant “term” then an ontology may not be appropriate, use the appropriate database instead, with caveats
    • If you need a property to describe a piece of data, then you may need to look in existing semantics schemas, e.g. schemas encoded in RDFS, a shape language such as ShEx, or LinkML
  • When looking for an ontology term, favor OBO over non-OBO resources
    • Sometimes better coverage is only available outside OBO – e.g. EDAM has a lot more terms for describing bioinformatics software artefacts, and EDAM is not in OBO. But it is still good to engage owners of the appropriate OBO ontology
    • When a non-OBO ontology is selected, use the OBO principles and guidelines to help evaluation – e.g is the ontology open? Does it follow good identifier lifecycle management?
    • The OLS uses a broader selection of ontologies that is narrower than what is in Bioportal. In my experience the non-OBO ontologies they include are quite pragmatic choices in many situations, e.g EDAM.
  • Even within OBO, there may still be confusion as to which ontologies to use, especially when many seem to have overlapping concepts, and scope may be poorly defined.
    • An example is the term used to describe an organism’s core metabolism for our microbial knowledge graph KG-Microbe, with multiple OBO contenders.
    • Always consult http://obofoundry.org to glean information about the ontology. This is always the canonical unbiased source, and includes curated up-to-date metadata
    • A crucial piece of metadata that is in OBO is the ontology status – you must avoid using ontologies that are obsoleted, and you should avoid using ontologies that are marked inactive
    • Look at the ‘usages’ field in OBO. Has the ontology been used for similar purposes as what you intend? If the ontology has no usages, this is a worrying sign the ontology was made for ontologists rather than practical data annotators such as yourself (but note that some ontologies may be behind in curating their usages into OBO)
    • Look at the scope of the ontology, as defined on the OBO page. Is it well defined and clear? If not, consider avoiding. Is your term in scope for this ontology? If not then don’t use terms from the ontology just because the labels match.
    • Is the ontology an application ontology, i.e. an ontology that is not intended to be a reference for terms within a domain? If so it may not be fit for your intended use.
    • Consult others if in doubt. Many people in the group or in our funded projects are involved with specific ontologies.
    • You should be on the OBO slack, this is a good place to get advice.
    • Favor ontologies we are actively involved with or that follow similar data models and principles
    • Favor more active ontologies. OBO marks inactive ontologies with metadata tags that are clearly displayed, but you should still check
      • Is the github project active?
      • Are there many tickets that are never answered?
      • If you suspect an ontology is not active but it is not marked as such, be a good citizen and raise this on the OBO tracker
    • Use precedence – see what has been done previously in similar projects
    • We are actively working on projects like OBO Dashboard and on improving OBO metadata to help ontology selection
  • For any candidate term
    • Is it obsoleted? If so avoid, but look at the metadata for replacements
    • Does it have a definition? A core OBO principle is that reasonable attempts must be made to define terms in an ontology
    • Is there a taxonomic scope? Always use the appropriate taxonomic scope. If an uberon term is restricted to vertebrates, it is valid to use for humans. But if an ontology or term is designed for use with mouse, it may not be valid to apply terms for humans
    • Have others used the term?
    • If you have formal ontology training, avoid over-ontologizing in your thought processes for selection. See for example the section below on shadow terms.
    • Avoid terms that seem over-ontologized; e.g. that have strange labels a domain scientist would not understand
  • If you are looking for terms to categorize nodes or edges in a Knowledge Graph:
    • For most of our projects, KGs should conform to the biolink-model, so this is the appropriate place to search
    • Note that biolink still leverages OBO and standard bioinformatics databases for the nodes themselves; biolink classes and predicates are used for the node categories and edge labels
  • For environmental samples use GSC MIxS terms for column headers
    • Use ENVO for describing the environment

Figure: The OBO site provides up to date metadata for its ontologies. An example of an ontology marked deprecated, with the suggested replacement. Note in this case this ontology was not deprecated due to quality issues, instead the developers worked with a different ontology to incorporate their work, and provided new IDs for all their existing classes.

The biolink model will serve as a canonical guide for what kinds of IDs should be used for any kind of entity. The SOP is to find the category of relevance in biolink, and then examine the id_prefixes field. This indicates the resources that provide identifiers that are valid to use for that entity type, in priority order.

For example, for BiologicalProcess you will see on the page and in the yaml

    id_prefixes:

      – GO

      – REACT

      – MetaCyc

      – KEGG.MODULE ## M number

Figure: portion of Biolink page for the data class BiologicalProcess. The favored ID prefixes are shown

This means that GO is always our favored ontology / ID space for representing a biological process. This followed by reactome, then metacyc, then kegg. Of course, GO and Reactome serve different purposes, with reactome pathway IDs classified using GO IDs. If you disagree with this ordering you can make a PR on biolink (or you can also make a project-specific extensions/contraction of biolink).

Avoid a pick-and-mix approach. It is better to draw like terms from the same ontology, this ensures overall coherence, and allows reasoning to be better leveraged.

If you are creating a LinkML enum, a good rule of thumb is that all ‘meaning’ annotations should come from the same ontology. Of course, this may not always be the case.

For example, the enum for sex_chromosome_type in chromo is all drawn from GO:

  SexChromosomeType:

    description: >-

      what type of sex chromosome

    permissible_values:

      X:

        meaning: GO:0000805

      Y:

        meaning: GO:0000806

      W:

        meaning: GO:0000804

      Z:

        meaning: GO:0000807

Similarly for the gp2term relationship field in GPAD, these are all drawn from RO:

(note that part_of, BFO:0000050, is actually in RO, not BFO, despite the ID space)

However, entity type is drawn from SO, GO, and PR. 

It is a bad smell to have a mix of different ontologies for what should be a set of similar entities, e.g

  metabolism_enum:

    ## TODO: the mappings below are automated

    permissible_values:

      anaerobic:

        description: anaerobic

        meaning: http://purl.obolibrary.org/obo/NCIT_C103137

      strictly anaerobic:

        description: strictly anaerobic

        meaning: http://identifiers.org/teddy/TEDDY_0000007

      obligate aerobic:

        description: obligate aerobic

        meaning: http://purl.obolibrary.org/obo/NCIT_C28341

      aerobic:

        description: aerobic

        meaning: http://purl.obolibrary.org/obo/EO_0007024

      facultative:

        description: facultative

        meaning: http://purl.obolibrary.org/obo/OMP_0000087

      microaerophilic:

        description: microaerophilic

        meaning: http://purl.obolibrary.org/obo/MICRO_0000515

      obligate anaerobic:

        description: obligate anaerobic

        meaning: http://purl.obolibrary.org/obo/NCIT_C103137

Some ontologies are themselves ad-hoc in their scoping, which can make it harder to determine which ontologies to go to find terms or request terms. Always favor ontologies with clear scope. We are actively working to fix scope problems in OBO:

https://github.com/orgs/OBOFoundry/projects/7

Avoid shadow terms

Many ontologies mint “shadow concepts”. For example OBA may have the core concept of “blood pressure”. Another ontology many have a random mix of “datum” or “measurement” classes, e.g. “blood pressure datum”. Avoid these terms. Even if you want to describe a blood pressure measurement, just use the core concept. The fact that the concept is deployed in the context of a measurement should be communicated externally, e.g. in the data model you are using, not by precoordination. 

By using the core concept you increase the overall coherency and connectivity of the information you are describing. Many shadow terms are in ‘application ontologies’ and are not properly linked to the core concepts.

Note that my own recommendations may not be aligned with the broader OBO community – see this ticket for further discussion.

Exercise due diligence in looking for the terms

Make sure the concept you need is definitely not present in the ontology before requesting. 

Learn how to use ontology search tools appropriately. I recommend:

  • OLS for OBO and selected other ontologies
  • Bioportal for searching the broadest set of ontologies

Bioportal has the broadest collection (including all of OBO), but there is less of a filter. Ontologies may not be open. However, being in OBO is not a guarantee of quality, and there may be good reasons to use a non OBO ontology.

Expert ontologists may like to use Ontobee, but there are many things to be aware of before using it:

  • The update frequency is less than OLS
  • It does not display the partonomy, which is crucial for understanding many of the ontologies we work on
  • Overall it presents a more ‘close to the base metal’ OWL model. This is fine for ontologists, but it is better not to point biologists here

If you are not experienced with ontologies and in particular OBO, there are many things that potentially trip you up. Don’t be afraid to ask about these — many people in the same shoes have you have been confused.

Potential confusion point: Some ontologies import other ontologies, or parts of other ontologies

This means e.g. if you are searching for a chemical element like ‘nitrate’ you may find results “in” ENVO, because ENVO imports a portion of CHEBI.

Bioportal does a good job of separating out the core concept/IRI from imports of it:

In these cases, the ID is the same, but you should be aware what the true parent ontology is.

OLS also does a good job of collapsing these:

Ontobee:

Potential confusion point: Some ontologies replicate parts of other ontologies

This is distinct from the import case above. In this case, one ontology may intentionally or unintentionally duplicate concepts from another. For example, the OMIT ontology copies large amounts of MESH and gives these new IDs. In these cases you should identify the authority and use the ID from there.

For example, a search for cockatoos in Bioportal shows MESH, MESH IDs reused elsewhere, as unlinked concept IDs presumably showing the same concept.

Searching semantic web schemas

For searching for terms in semantic web standards, https://lov.linkeddata.es/ is probably the best

Search tips

Note that sometimes you need to do more work than just entering a string. Most ontology search tools won’t do stemming etc. I recommend searching for similar concepts and exploring the neighborhood. Understanding the structure of the ontology will help you make a better request.

For example, imagine you are looking for a concept ‘bicycle’ in a product ontology. Just because nothing comes back in a search for ‘bicycle’ doesn’t mean the concept isn’t there. It may be under a synonym  like ‘bike’. Explore the ontology. Look for similar concepts like unicycle or car. If you see that the ontology has a class vehicle, subclasses like 4-wheeled, 2-wheeled, and 1-wheeled, but doesn’t have anything under 2-wheeled you can be confident the concept is missing.

Don’t get too hung up on this if you are not a domain scientist and don’t understand the concepts in the ontology, but it is usually a good idea to do this kind of initial exploration.

Mapping or searching for sets of terms

If you have a set of terms to map and you want to get a sense of coverage in different ontologies the parallel tools are:

  • Zooma
  • Bioportal annotator

This topic is deserving of its own post, so I won’t go into more detail here

Use GitHub for making requests

Make sure to consult the group best practice document on GitHub and the group GitHub overview

If you are sure the ontology doesn’t have the concept you need, you will want to make a request

In general, always use GitHub for making the request. If you know the email or slack of someone with ability to make terms it may be tempted to contact them directly but it’s better to use GitHub. This makes the process transparent, and you will be helping people who come after you.

If you cannot find a GitHub repo for the ontology/standard, this is a bad smell and maybe you should reconsider whether you want to use this ontology. Having a public repo (GitHub/GitLab/Bitbucket/etc) is a requirement of OBO. Note this same advice applies to software.

For OBO ontologies, you can easily find the GitHub issue tracker for any ontology via http://obofoundry.org 

For some ontologies, there may be specialized term request systems (PRO, CHEBI). Go by the norms of the particular ontology, but my own preference is always to use GitHub, for the reasons stated above.

Search existing issues before making a request. It may be the case that others have requested the same term before you. Maybe it is out of scope, and the term is in a different ontology. Searching the issue tracker should reveal this (this is why it is good to always stay within github when making requests rather than private communication). It may be the case that the ontology refuses to grant requests for reasons that are arbitrary. In this case there may be a issue discussing the pros and cons. Read this and add your voice, but in a constructive fashion. Maybe a simple up-vote is sufficient. Or a comment like “similar to Mary, I also would find a term “Hawaiian pizza” very valuable”.

Always link the issue that you are working on to the term request. In GitHub, this is simply a matter of putting the URL in. You can either link from the request issue back to the parent issue, or from the parent issue to the request. 

(Note you will always be working to a issue. If you’re not, stop what you are doing and make one!)

You should also search the issue tracker to see if others have made the same requestavoid making duplicate issues. But you can still comment on existing ones. However, avoid tacking on or extending the scope of an existing issue. If there is a similar issue but your request is different, link to the current issue (#NUMBER – you should know github conventions), e.g. “My request is similar to fred’s in #1234, but I need a foo not a bar”.

Read the CONTRIBUTING.md file

An increasing number of ontologies and other modeling artefacts include this in their repo. It should include guidance for people like you that is more specialized than this generalized guide. Read it!

Provide as much help as possible in your request

Remember, many ontologies are under-funded and requests are often fulfilled by our collaborators. Provide as much help as possible to them. If you are not knowledgeable about the domain, that is OK, but you can still provide context about your project.

e.g. “hey, I have been asked to provide a UI selection box with different pizza types. My boss gave me this list of ten pizza types but I don’t eat pizza, and I’m not sure how to map them to your ontology, and I may have some duplicates, it looks like you don’t have ‘Hawaiian’, but I’m wondering if maybe this ‘pineapple and ham’ is the same thing, or is there some subtle difference? If it’s the same, shouldn’t there be a synonym added?

If you have been given a spreadsheet, you can provide a link to it. If you are mapping a data table, provide a link, or selected examples, as this can help orient the person fulfilling the request. Remember, people aren’t mind readers.

For example, making a ticket where you say “I need you to add HSap” is not helpful. But if you can say the HSap value appears in a column called species, and the other values are MMus, ‘DMel’, this gives the ontology developer the context they need, avoiding the need for confusing back and fortheon the issue tracker.

Analogies are useful: If you can find analogous terms use these as examples.

If you have a domain scientist handy, you may want to engage them before making requests – e.g. if they can provide definitions. 

There is no rule as to whether to make one ticket/issue with multiple terms, or one ticket per term. If you think each term is nuanced and requires individual discussion, make separate tickets. If you are unsure then make an initial exploratory ticket. It’s usually OK to make a ticket for a question (GitHub even has a category/label for this).

Avoid making 100 requests only to discover that all of your requests are out of scope, requiring tedious closing of multiple tickets.

Be proactive and make pull requests

Even ontologies that have dedicated funding are under-resourced. You can help a lot by offering to make pull requests. If the ontology is a well-behaved OBO ontology there should be a clear procedure for doing this (if the ontology was made with ODK or follows ODK conventions, the file you should edit is src/ontology/foo-edit.owl in the repo).

Note that editing the OWL file usually entails using Protege. Basic Protege skills are worth learning. Normally this would not be required of most users, but in my group having basic Protege driving skills is useful and strongly encouraged

In some cases you don’t need to edit the file – the ontology SOP may dictate editing a TSV in github or google sheet, with this compiled to OWL. Consult the contributor docs for that ontology (and if these are lacking, gently suggest ways to improve this to make it easier for those who come after you to contribute).

You will likely need to be added to an idranges file – again if the ontology follows standard conventions this will be obvious.

It is a good idea to check if an ontology is welcoming of PRs. This should be obvious from the pulls tab in GitHub. In general most ontologies should be, but some ontology groups may have trouble adapting to the times and may still be unfamiliar and may prefer issues. Also in many cases the addition of terms is best done by an expert.

In all cases, use your best judgment!

Follow templates where possible

Many repos are set up with GitHub issue templates. If the repo you are requesting in does not, you may want to gently suggest they do (or better yet, make a PR for this, using ODK as a guide). If you are reading this document then you likely have more github-foo than the ontologist/curator fulfilling the request, you can be helpful!

In some cases, ontologies may have set up a templating system (robot or dosdp). You can be super-helpful and follow the system that has been set up. In some cases this means filling in a predefined google sheet (e.g. with columns for name, definition, parent). In some cases you can make a PR on a TSV in the repo. This is an evolving area, so stay tuned. If the process is not clear there are people in the group with expertise who can help.

If all else fails, make your own ‘application ontology’

Sometimes there may simply be no ontology fit for purpose. Or existing ontologies may simply be unable to fulfil your request. It may be the case that there is an ontology called ‘pizza ontology’ squatting this conceptal space in OBO, but they may fail to grant your term requests for arbitrary reasons (“we don’t add Hawaiian pizzas, as we object in principle to putting pineapple on pizza”), or have unrealistic timelines (“we have a pizza modeling discussion set for 2 years now at the annual pizza ontology conference, we may consider putting your request before the committee then, but it is unlikely to be ratified for 4 years“). They may make it impossible to add terms by being ontological perfectionist (“we will add your pizza if you add perfect OWL logical axiomatization describing topological and gastronomic properties of the pizza according to our undocumented design practice”). They may also simply model things incorrectly (“thanks for your request. We have added ‘Hawaiin Pizza’ as a subclass of ‘Hawaii’. Aloha!”)

In general this won’t happen, especially with well-behaved OBOs, but there may be some holdouts!  Be patient, and offer to make PRs (see above).

In some cases, such as those above, you may be justified in making your own ontology, using tools like ODK and ROBOT. Consult first! And never do this without first making requests.

In some cases you don’t need to make a new ontology, you can just create stubs. E.g. for a KG ingest, you can ‘inject’ something into the biolink-model, e.g. biolink:Pizza. There are various downsides to the injection approach, it may be better to use a different namespace. Depending on the project context it may or may not matter if the injected type resolves. Regardless, when doing this, add a comment to your code with a link to the ticket

Be bold and be collaborative

Whether you are making or fulfilling a request, you are all part of the same larger community of people working to make data more computable. Be as constructive and as helpful as possible, but also don’t hold back or be shy. Ultimately the ontology is there to serve you. But if it does not serve your need, is too confusing in some aspect, then it’s likely the same case for others. 

Discussion

Overall the processes described above may seem overly complex or onerous. In fact they are not so different from analogous processes such as getting features into a piece of open source software.

Over the years there have been various proposals and implementations of ‘term brokers’ which act as both triage and a place to get an identifier for a term instantly. An example implementation is TermGenie

One reason why term brokers have not taken over as a way of getting terms into ontologies over the github procedure above is that there is a strong tendency to accumulate ontological debt (akin to technical debt). It’s easy to stick a bunch of junk terms into an ontology. But maintaining these and dealing with the downstream costs of including these can be very high.

This topic needs a blog post all of its own, stay tuned…

Design Patterns · Ontologies · Reasoning · ROBOT

Avoid mixing parthood with cardinality constraints

We frequently have situations where we want to make an assertion combining parthood and cardinality. Here, cardinality pertains to the number of instances.

Section 5.3 of the OWL2 primer has an example of a cardinality restriction,

Individual: John
   Types: hasChild max 4 Parent

This is saying that John has at most 4 children that are parents.

A more natural biological example of a cardinality restriction is stating something like:

Class: Hand
   SubClassOf: hasPart exactly 5 Finger

i.e. every hand has exactly 5 fingers. Remember we are following the open world assumption here – we are not saying that our ontology has to list all 5 fingers. We are stating that in our model of the world, any instance of a hand h entails the existence of 5 distinct finger instances f1…f5, each of which is related to that hand, i.e. fn part-of h. Furthermore there are no other finger instances not in the set f1..5 that are part of h.

The precise set-theoretic semantics and provided in the OWL2 Direct Semantics specification.

This cardinality axiom seems like a perfectly natural, reasonable, and useful thing to add to an ontology (avoiding for now discussions about “canonical hands” and “variant hands”, for example in cases like polydactyly; let’s just assume we are describing “canonical anatomy”, whatever that is).

5 Kids Hand Showing The Number Five Hand Sign Stock Illustration - Download  Image Now - iStock
A canonical hand, with 5 fingers

And in fact there are various ontologies in OBO that do this kind of thing.

However, there is a trap here. If you try and run a reasoner such as HermiT you will get a frankly perplexing and unhelpful error such as this.

An error occurred during reasoning: Non-simple property 'BFO_0000051' or its inverse appears in the cardinality restriction 'BFO_0000051 exactly 5 UBERON_1234567

If you have a background in computer science and you have some time to spare you can go and look at section 11 of the OWL specification (global restrictions on axioms in OWL2 DL) to see what the magical special laws you must adhere to when writing OWL ontologies that conform to the DL profile, but it’s not particularly helpful to understand:

  • why you aren’t allowed to write this kind of thing, and
  • what the solution is.

Why can’t I say that?

A full explanation is outside the scope of this article, but the basic problem arises when combining transitive properties such as part-of with cardinality constraints. It makes the ontology fall outside the “DL” profile, which means that reasoners such as HermiT can’t use it, so rather ignore it HermiT will complain and refuse to reason.

Well I want to say that anyway, what happens if I do?

You may choose to assert the axiom anyway – after all, it may feel useful for documentation purposes, and people can ignore it if they don’t want it, right? That seems OK to me, but I don’t make the rules.

Even if you don’t intend to stray outside DL, an insidious problem arises here: many OBO ontologies use Elk as their reasoner, and Elk will happily ignore these DL violations (as well as anything it can’t reason over, outside it’s variant of the EL++ profile). This in itself is totally fine – its inferences are sound/correct, they just might not be complete. However, we have a problem if an ontology includes these DL violations, and the relevant portion of that ontology is extracted and then used as part of another ontology with a DL reasoner such as HermiT that fails fast when presented with these axioms. In most pipelines, if an ontology can’t be reasoned, it can’t be released, and everything is gummed up until an OWL expert can be summoned to diagnose and fix the problem. Things get worse if an ontology that is an N-th order import has a DL violation, as it may require waiting for all imports in the chain to be re-released. Not good!

Every now and then this happens with an OBO ontology and things get gummed up, and people naturally ask the usual questions, why did this happen, why can’t I say this perfectly reasonable thing, how do I fix this, hence this article.

How do we stop people from saying that?

Previously we didn’t have a good system for stopping people from making these assertions in their ontologies, and the assertions would leak via imports and imports of imports, and gum things up.

Now we have the wonderful robot tool and the relatively new validate-profile command, which can be run like this:

robot validate-profile --profile DL \
  --input my-ontology.owl \
  --output validation.txt

This will ensure that the ontology is in the DL profile. If it isn’t, this will fail, so you can add this to your Makefile in order to fail fast and avoid poisoning downstream ontologies.

This check will soon be integrated into the standard ODK setup.

OK, so how do I fix these?

So you have inadvertently strayed outside the DL profile and your ontology is full of has-parts with cardinality constraints – you didn’t mean it! You were only alerted when a poor downstream ontology imported your axioms and tried to use a DL reasoner. So what do you do?

In all honesty, my advice to you is ditch that axiom. Toss it in the bin. Can it. Flush that axiom down the toilet. Go on. It won’t be missed. It was already being ignored. I guarantee it wasn’t doing any real work for you (here work == entailments). And I guarantee your users won’t miss it.

A piece of advice often given to aspiring writers is to kill your darlings, i.e. get rid of your most precious and especially self-indulgent passages for the greater good of your literary work. The same applies here. Most complex OWL axioms are self-indulgences.

Even if you think you have a really good use case for having these axioms, such as representing stoichiometry of protein complexes or reaction participants, the chances are that OWL is actually a bad framework for the kind of inference you need, and you should be using some kind of closed-world reasoning system, e.g. one based on datalog.

OK, so maybe you don’t believe me, and you really want to say something that approximates your parts-with-numbers. Well, you can certainly weaken your cardinality restriction to an existential restriction (provided the minimum cardinality is above zero; for maxCardinality of zero you can use a ComplementOf). So in the anatomical example we could say

Class: Hand
   SubClassOf: hasPart some Finger

This is still perfectly sound – it is not as complete as your previous statement, but does that matter? What entailments were you expecting from the cardinality axiom. If this is intended more for humans, you can simply annotate your axiom with a comment indicating that humans typically have 5 fingers.

OK, so you still find this unsatisfactory. You really want to include a cardinality assertion, dammit! Fine, you can have one, but you won’t like it. We reluctantly added a sub-property of has-part to RO called ‘has component’:

In all honesty the definition for this relation is not so great. Picking holes in it is not so hard. It exists purely to get around this somewhat ridiculous constraint, and for you to be able to express your precious cardinality constraint, like this:

Class: Hand
   SubClassOf: hasComponent exactly 5 Finger

So how does this get around the constraint? Simple: hasComponent is not declared transitive. (recall that transitivity is not inferred down a property hierarchy). Also it is a different relation (a subproperty) from has-part, so you might not get the inferences you expect. For example, this does NOT prevent me from making an instance of hand that has as parts 6 distinct fingers – it only applies to the more specific relation, which we have not asserted, and is not inferred.

You can make an argument that this is worse than useless – it gives no useful entailments, and it confuses people to boot. I am responsible for creating this relation, and I have used it in ontologies like Uberon, but I now think this was a mistake.

Other approaches

There are other approaches. For a fixed set of cardinalities you could create subproperties, e.g. has-1-part-of, has-2-parts-of, etc. But these would still be less expressive than you would like, and would confuse people.

A pattern that does work in certain cases such as providing logical definitions for things like cells by number of nuclei is to use the EL-shunt pattern (to be covered in a future article) and make characteristic classes in an ontology like PATO.

While this still isn’t as expressive, it allows you to use proxies for cardinality in logical definitions (which do actual work for you), and shunt off the cardinality reasoning to a smaller ontology — where really it’s actually fine to just assert the hierarchy.

But this doesn’t work in all cases. There is no point making characteristics/qualities if they are not reused. It would be silly to do this with the hand example (e.g. making a has5fingers quality).

Isn’t this all a bit annoying?

Yes, it is. In my mind we should be free to state whatever axioms we need for our own use cases, expressivity be damned. I should be able to go outside DL, in fact to use anything from FOL or even beyond. We should not be condemned to live in fear of stepping outside of decidability (which sounds like something you want, but in practice is not useful). There are plenty of good strategies for employing a hybrid reasoning strategy, and in any case, we never use all of DL for most large OBO ontologies anyway.

But we have the technology we do, and we have to live with it for now.

TL;DR

  • don’t mix parthood and cardinality
  • you probably won’t miss that cardinality restriction anyway
  • no really, you won’t
  • use robot to check your profiles

Design Patterns · Ontologies · Ontology Development Kit · Standards

Aligning Design Patterns across Multiple Ontologies in the Life Sciences

I was delighted to give the keynote at http://ISWC 2020 Workshop on Ontology Design and Patterns today. You can see a video or my slides, I’m including a brief summary here.

Opening slide: Aligning Design Patterns Across Multiple Ontologies in the Life Sciences

As this was a CS audience that may be unfamiliar with some of the issues we are tackling in OBO I framed this in terms of the massive number of named entities in the life sciences, all of which have to be categorized if we are to be able to find, integrated, and analyze data:

there are many named things in the life sciences

We created OBO to help organize and integrate the ontologies used to categorize these things

OBO: social and technological framework to categorize all the things

When we started, many ontologies were quite ‘SKOS-like’ in their nature, with simple axiomatization, and a lack of integration:

GO and other ontologies were originally SKOS-like

OWL gives us a framework for more modular development, leveraging other ontologies, and using reasoning to automate classification:

OWL reasoning can be used to maximize leverage in a modular approach: here re-using CHEBI’s chemical classification in GO

This is all great, but when I look at many ontologies I often see two problems, often in the same ontology, under- and over- axiomatization:

Finding the balance between under and over axiomatization

In some ontologies I see what I sometimes call ‘Rococo OWL’, over-axiomatization in an elaborate dense set of axioms that looks impressive but don’t deliver much functional utility (I plan to write about this in a future post).

Rococo: an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe l’oeil frescoes to create surprise and the illusion of motion and drama. The style was highly theatrical, designed to impress and awe at first sight. a movement that extolled frivolity, luxury and dilettantism, patronised by a corrupt and decadent ancien régime. Rococo ended in the revolution of 1789, with the bloody end of a political and economic system

We developed Dead Simple OWL Design Patterns (DOSDPs) to make it easier to write down and reuse common OWL patterns of writing definitions, primarily for compositional classes following the Rector Normalization pattern.

Example DOSDP yaml file

I gave an example of how we used DOSDPs to align logical definitions across multiple phenotype ontologies (the uPheno reconciliation project). I would like to expand on this in a future post.

multiple phenotype databases for humans and model organisms, and their respective vocabularies/ontologies

I finished with some open-ended questions about where we are going and whether we can try and unify different modeling frameworks that tackle things from different perspectives (closed-world shape-based on object-oriented modeling, template/pattern frameworks, lower level logical modeling frameworks).

Unifying multiple frameworks – is it possible or advisable?

Unfortunately due to lack of time I didn’t go into either ROBOT templates or OTTR templates.

And in closing, to emphasize that the community and social aspects are as important or more important than the technology:

Take homes

And some useful links:

http://obofoundry.org/resources

https://github.com/INCATools/

○ ontology-development-kit

dead_simple_owl_design_patterns (dos-dps)

dosdp-tools

Special thanks to everyone in OBO and the uPheno reconciliation effort, especially David-Osumi Sutherland, Jim Balhoff, and Nico Matentzoglu.

And thanks to Pascal Hitzler and Cogan Shimizu for putting together such a great workshop.

Knowledge Graphs

Edge properties, part 1: Reification

This is the first part of what will be a multi-part series. See also part 2 on the singleton property pattern.

One of the ways in which RDF differs from Labeled Property Graph (LPG) models such as the data model in Neo4J is that there is no first-class mechanism for making statements about statements. For example, given a triple :P1 :interacts-with :P2, how do we say that triple is supported by a particular publication?

With an LPG, an edge can have properties associated with it in addition to the main edge label. In Neo4J documentation, this is often depicted as tag-values underneath the edge label. So if the assertion that P1 interacts with P2 is supported by a publication such as PMID:123 we might write this as:

(Note that some datamodels such as Neo4J don’t directly support hypergraphs, and if we wanted to represent pmid:123 as a distinct node with its own propertiess, then the association between the edge property and the node would be implicit rather than explicit)

In RDF, properties cannot be directly associated with edges. How would we represent something like the above in RDF? In fact there are multiple ways of modeling this.

A common approach is reification. Here we would create an extra node representing the statement, associate this with the original triple via three new triples, and then the statement node can be described as any other node. E.g.

This can be depicted visually as follows (note that while the first triple directly connecting P1 and P2 may seem redundant, it is not formally entailed by RDF semantics and should also be stated):

This is obviously quite verbose, so there are a different visual conventions and syntactic shortcuts to reduce bloat.

RDF* provides a more convenient compact syntax for writing edge properties:

  • <<:P1 :interacts_with :P2>>  :supported_by :pmid123 .

Here the <<…>> can be seen as acting as syntactic sugar, with the above single line of RDF* expanding to the 6 triples above.

RDF* is not yet a W3 standard, but a large number of tools support it. It is accompanied by SPARQL* for queries.

There is a lot more to be said about the topic of edge properties in LPGs and RDF, I will try to cover these in future posts. This includes:

  • Alternatives to RDF reification, of which there are many
    • Named Graphs, which capitalize on the fact that triplestores are actually quad stores, and use the graph with which a triple is associated with as a site of attachment for edge properties.
    • The Singleton Property Pattern (SPP). This has some adherents, but is not compatible with OWL-DL modeling. This is addressed in part two of this series
    • Alternative Reification Vocabularies. This includes the OWL reification vocabulary. It’s immensely depressing and confusing and under-appreciated that OWL did not adopt the RDF reification vocabulary, and the OWL stack fails horribly when we try and use the two together. Additionally OWL reification comes with annoying limitations (see my answer on stack overflow about RDF vs OWL reification).
    • RDF* can be seen as an alternative or it can be seen as syntactic sugar and/or a layer of abstraction over existing RDF reification
    • various other design patterns such as those in https://www.w3.org/TR/swbp-n-aryRelations/
  • Semantics of reification. RDF has monotonic semantics. This means that adding new triples (including reification triples) cannot retract the meaning of any existing triples (including the reified triples). So broadly speaking, it’s fine to annotate a triple with metadata (e.g. who said it), but not with something that alters it’s meaning (e.g. a negation qualifier, or probabilistic semantics). This has implications on how we represent knowledge graphs in RDF, and on proposals for simpler OWL layering on RDF. It also has implications for inference with KGs, both classic deductive boolean inference as well as modern KG embedding and associated ML approaches (e.g node2vec, embiggen).
  • Alternate syntaxes and tooling that is compatible with RDF and employs higher level abstractions above the verbose bloated reification syntax/model above. This includes RDF*/SPARQL* as well as KGX.

Next: Edge properties, part 2: singleton property pattern (and why it doesn’t work)

Uncategorized

The Open World Assumption Considered Harmful


A frequent source of confusion with ontologies and more generally with any kind of information system is the Open World Assumption. This trips up novice inexperienced users, but as I will argue in this post, information providers could do much more to help these users. But first an explanation of the terms:

With the Open World Assumption (OWA) we do not make any assumptions based on the absence of statements. In contrast, with the Closed World Assumption (CWA), if something is not explicitly stated to be true, it is assumed to be false. As an example, consider a pet-owner database with the following facts:

Fred type Human .
Farrah type Human .

Foofoo type Cat .
Fido type Dog .

Fred owns Foofoo .
Farrah owns Fido.

Depicted as:

depiction of pet owners RDF graph, with triples indicated by arrows. RDF follows the OWA: the lack of a triple between Fred and Fido does not entail that Fred doesn’t own Fido.

Under the CWA, the answer to the question “how many cats does Fred own” is 1. Similarly, for “how many owners does Fido have” the answer also 1.

RDF and OWL are built on the OWA, where the answer to both question is: at least 1. We can’t rule out that Fred also owns Fido, or that he owns animals not known to the database. With the OWA, we can answer the question “does Fred own Foofoo” decisively with a “yes”, but if we ask “does Fred own Fido” the answer is “we don’t know”. It’s not asserted or entailed in the database, and neither is the negation.

Ontology formalisms such as OWL are explicitly built on the OWA, whereas traditional database systems have features constructed on the CWA.

OWL gives you mechanisms to add closure axioms, which allows you to be precise about what is known not be to true, in addition what is known to be true. For example, we can state that Fred does not own Fido, which closes the world a little. We can also state that Fred only owns Cats, which closes the world further, but still does not rule out that Fred owns cats other than Foofoo. We can also use an OWL Enumeration construct to exhaustively list animals Fred does own, which finally allows the answer to the question “how many animals does Fred own” with a specific number.

OWL ontologies and databases (aka ABoxes) often lack sufficient closure axioms in order to answer questions involving negation or counts. Sometimes this is simply because it’s a lot of work to add these additional axioms, work that doesn’t always have a payoff given typical OWL use cases. But often it is because of a mismatch between what the database/ontology author thinks they are saying, and what they are actually saying under the OWA. This kind of mismatch intent is quite common with OWL ontology developers.

Another common trap is reading OWL axioms such as Domain and Range as Closed World constraints, as they might be applied in a traditional database or a CWA object-oriented formalism such as UML.

Consider the following database plus ontology in OWL, where we attempt to constrain the ‘owns’ property only to humans

owns Domain Human
Fido type Dog
Fred type Human
Fido owns Fred

We might expect this to yield some kind of error. Clearly using our own knowledge of the world something is amiss here (likely the directions of the final triple has been accidentally inverted). But if we are to feed this in to an OWL reasoner to check for incoherencies (see previous posts on this topic), then it will report everything as consistent. However, if we examine the inferences closely, we will see that it is has inferred Fido to be both a Dog and a Human. It is only after we have stated explicit axioms that assert or entail Dog and Human are disjoint that we will see an inconsistency:

OWL reasoner entailing Fido is both a Dog and a Human, with the latter entailed by the Domain axiom. Note the ontology is still coherent, and only becomes incoherent when we add a disjointness axiom

In many cases the OWA is the most appropriate formalism to use, especially in a domain such as the biosciences, where knowledge (and consequently our databases) is frequently incomplete. However, we can’t afford to ignore the fact that the OWA contradicts many user expectations about information systems, and must be pragmatic and take care not to lead users astray.

BioPAX and the Open World Assumption

BioPAX is an RDF-based format for exchanging pathways. It is supposedly an RDF/OWL-based standard, with an OWL ontology defining the various classes and properties that can be used in the RDF representation. However, by their own admission the authors of the format were not aware of OWL semantics, and the OWA specifically, as explained in the official docs in the level 2 doc appendix, and also further expanded on in a paper from 2005 by Alan Ruttenberg, Jonathan Rees, and Joanne Luciano, Experience Using OWL DL for the Exchange of Biological Pathway Information, in particular the section “Ambushed by the Open World Assumption“. This gives particular examples of where the OWA makes things hard that should be easy, such as enumerating the members of a protein complex (we typically know all the members, but the BioPAX RDF representation doesn’t close the world).

BioPAX ontology together with RDF instances from EcoCyc. Triples for a the reaction 2-iminopropanoate + H2O → pyruvate + ammonium is shown. The reaction has ‘left’ and ‘right’ properties for reactants such as H2O. These are intended to be exhaustive but the lack of closure axioms means that we cannot rule out additional reactants for this reaction.

The Ortholog Conjecture and the Open World Assumption

gene duplication and speciation, taken from http://molecularevolutionforum.blogspot.com/2012/12/ortholog-conjecture-debated.html

In 2011 Nehrt et al made the controversial claim that they had overturned the ortholog conjecture, i.e they claimed that orthologs were less functionally similar than paralogs. This was in contrast to the received wisdom, i.e if a gene duplicates with a species lineage (paralogs) there is redundancy and one copy is less constrained to evolve a new function. Their analysis was based on semantic similarity of annotations in the Gene Ontology.

The paper stimulated a lot of discussion and follow-up studies and analyses. We in the Gene Ontology Consortium published a short response, “On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report“. In this we pointed out that the analysis assumed the CWA (absence of function assignment means the gene does not have that function), whereas GO annotations should be interpreted under the OWA (we have an explicit way of assigning that a gene does not have a function, rather than relying on absence). Due to bias in GO annotations, paralogs may artificially have higher functional similarity scores, rendering the original analysis insufficient to reject the ortholog conjecture.

The OWA in GO annotations is also addressed in the GO Handbook in the chapter Pitfalls, Biases, and Remedies by Pascale Gaudet. This chapter also makes the point that OWA can be considered in the context of annotation bias. For example, not all genes are annotated at the same level of specificity. The genes that are annotated reflect biases in experiments and publication, as well as what is selected to be curated.

Open World Assumption Considered (Sometimes) Harmful

The OWA is harmful where it grossly misaligns with use expectations.

While a base assumption of OWA is usually required with any curated information, it is also helpful to think in terms of an overriding implicit contract between any information provider and information consumer: any (good) information provider attempts to provide as complete information as is possible, given resource constraints.

My squid has no tentacles

Let’s take an example: If I am providing an ontology I purport to be an anatomical ontology for squid, then it behooves me to make sure the main body parts found in a squid are present.

Chiroteuthis veranii, the long armed squid, By Ernst Haeckel, with two elongated tentacles.

Let’s say my ontology contains classes for some squid body parts such as eye, brain, yet lacks classes for others such as the tentacle. A user may be surprised and disappointed when they search for tentacle and come back empty-handed (or empty tentacled, if they are a squid user). If this user were to tell me that my ontology sucked, I would be perfectly within my logical rights to retort: “sir, this ontology is in OWL and thus follows the Open World Assumption; as such the absence of a tentacle class in my squid ontology does not entail that squids lack tentacles, for such a claim would be ridiculous. Please refer to this dense interlinked set of documents published by the W3C that requires PhD in logic to understand and cease from making unwarranted assumptions about my ontology“.

Yet really the user is correct here. There should be an assumption of reasonable coverage, and I have violated that assumption. The tentacle is a major body part, it’s not like I have omitted some obscure neuroanatomical region. Is there a hard and fast dividing line here? No, of course not. But there are basic common sense principles that should be adhered to, and if they cannot be adhered to, omissions and biases should be clearly documented in the ontology to avoid confusing users.

This hypothetical example is made up, but I have seen many cases where biases and omissions in ontologies confusingly lead the user to infer absence where the inference is unwarranted.

Hydroxycholoroquine and COVID-19

The Coronavirus Infectious Disease Ontology (CIDO) integrates a number of different ontologies and includes axioms connecting terms or entities using different object properties. An example is the ‘treatment-for’ edge which connects diseases to treatments. Initially the ontology only contained a single treatment axiom, between COVID-19 and Hydroxychloroquine (HCQ). Under the OWA, this is perfectly valid: COVID-19 has been treated with HCQ (there is no implication about whether treatment is successful or not). However, the inclusion of a single edge of this type is at best confusing. A user could be led to believe there was something special about HCQ compared to other treatments, and the ontology developers had deliberately omitted these. In fact initial evidence for HCQ as a successful treatment has not panned out (despite what some prominent adherents may say). There are many other treatments, many of which are in different clinical trial phases, many of which may prove more effective, yet assertions about these are lacking in CIDO. In this particular case, even though the OWA allows us to legitimately omit information, from a common sense perspective, less is more here: it is better to include no information about treatments at all rather than confusingly sparse information. Luckily the CIDO developers have rapidly addressed this situation.

Ragged Lattices, including species-specific classes

An under-appreciated problem is the confusion ragged ontology lattices can cause users. This can be seen as a mismatch between localized CWA expectations on the part of the user and OWA on the part of the ontology provider. But first an explanation of what I mean by ragged lattice:

Many ontologies are compositional in nature. In a previous post we discussed how the Rector Normalization pattern could be used to automate classification. The resulting multi-parent classification forms a lattice. I have also written about how we should embrace multiple inheritance. One caveat to both of these pieces is that we should be aware of the confusion that can be caused by inconsistently populated (‘ragged’) lattices.

Take for example cell types, which can be classified along a number of orthogonal axes, many intrinsic to the cell itself – its morphological properties, it’s lineage, its function, or gene products expressed. The example below shows the leukocyte hierarchy in CL, largely based on intrinsic properties:

Protege screenshot of the cell ontology, leukocyte hierarchy

Another way to classify cells is by anatomical location. In CL we have a class ‘kidney cell’ which is logically defined as ‘any cell that is part of a kidney’. This branch of CL recapitulates anatomy at the upper levels.

kidney cell hierarchy in CL, recapitulating anatomical classification

so far, perfectly coherent. However, the resulting structure can be confusing to someone now used to thinking in OWL and the OWA. I have seen many instances where a user will go to a branch of CL such as ‘kidney cell‘ and start looking for a class such as ‘mast cell‘. It’s perfectly reasonable for them to look here, as mast cells are found in most organs. However, CL does not place ‘mast cell’ as a subclass of ‘kidney cell’ as this would entail that all mast cells are found in the kidney. And, CL has not populated the cross-product of all the main immune cell types with the anatomical structures in which they can be found. The fleshing out of the lattice is inconsistent, leading to confusion caused by violation of an assumed contract (provision of a class “kidney cell” and incomplete cell types underneath).

This is even more apparent if we introduce different axes of classification, such as the organism taxon in which the cell type is found, e.g. “mouse lymphocyte”, “human lymphocyte”:

inferred hierarchy when we add classes following a taxon design pattern, e.g. mouse cell, mouse lymphocyte. Only a small set of classes in the ontology are mouse specific.

Above is a screenshot of what happens when we introduce classes such as ‘mouse cell’ or ‘mouse lymphocyte’. We see very few classes underneath. Many people indoctrinated/experienced with OWL will not have a problem with this, they understand that these groupings are just for mouse-specific classes, and that the OWA holds, and absence of a particular compositional class, e.g. “mouse neuron” does not entail that mice do not have neurons.

One ontology in which the taxon pattern does work is the protein ontology, which includes groupings like “mouse protein”. PRO includes all known mouse proteins under this, so the classification is not ragged in the same way as the examples above.

There is no perfect solution here. Enforcing single inheritance does not work. Compositional class groupings are useful. However, ontology developers should try and avoid ragged lattices, and where possible populate lattices consistently. We need better tools here, e.g. ways to quantitative measure the raggedness of our ontologies.

Ontologies and databases should better document biases and assumptions

As providers of information, we need to do a better job of making all assumptions explicit and well-documented. This applies particularly to any curated corpus of knowledge, but in particular to ontologies. Even though hiding behind the OWA is logically defensible, we need to make things more intuitive for users.

It’s not uncommon for an ontology to have excellent very complete coverage of one aspect of the domain, and to be highly incomplete in another (reflecting either the biases/interests of the developers, or of the broader field). In fact I have been guilty of this in ontologies I have built or contributed to. I have seen users become confused when a class they expected to find was not present, or they have been perplexed by the ragged lattice problem, or an edge they expected to find was not present.

Few knowledge bases can ever be complete, but we can do better at documenting known unknowns or incompletenesses. We can imagine a range of formal computable ways of doing this, but a good start would be some simple standard annotation properties that can be used as inline documentation in the ontology. Branches of the ontology could be tagged in this way, e.g. to indicate that ‘kidney cell’ doesn’t include all cells found in the kidney, only kidney specific ones; or that developmental processes in GO are biased towards human and model organisms. This system could also be used for Knowledge Graphs and annotation databases too, to indicate that particular genes may be under-studied or under-annotated, an extension of the ND evidence type used in GO.

In addition we could do a better job at providing consistent levels of coverage of annotations or classes. There are tradeoffs here, as we naturally do not want to omit anything, but we can do a better job at balancing things out. Better tools are required here for detecting imbalances and helping populate information in a more balanced consistent fashion. Some of these may already exist and I’m not aware of them – please respond in the comments if you are aware of any!