Edge properties, part 1: Reification

One of the ways in which RDF differs from Labeled Property Graph (LPG) models such as the data model in Neo4J is that there is no first-class mechanism for making statements about statements. For example, given a triple :P1 :interacts-with :P2, how do we say that triple is supported by a particular publication?

With an LPG, an edge can have properties associated with it in addition to the main edge label. In Neo4J documentation, this is often depicted as tag-values underneath the edge label. So if the assertion that P1 interacts with P2 is supported by a publication such as PMID:123 we might write this as:

(Note that some datamodels such as Neo4J don’t directly support hypergraphs, and if we wanted to represent pmid:123 as a distinct node with its own propertiess, then the association between the edge property and the node would be implicit rather than explicit)

In RDF, properties cannot be directly associated with edges. How would we represent something like the above in RDF? In fact there are multiple ways of modeling this.

A common approach is reification. Here we would create an extra node representing the statement, associate this with the original triple via three new triples, and then the statement node can be described as any other node. E.g.

This can be depicted visually as follows (note that while the first triple directly connecting P1 and P2 may seem redundant, it is not formally entailed by RDF semantics and should also be stated):

This is obviously quite verbose, so there are a different visual conventions and syntactic shortcuts to reduce bloat.

RDF* provides a more convenient compact syntax for writing edge properties:

  • <<:P1 :interacts_with :P2>>  :supported_by :pmid123 .

Here the <<…>> can be seen as acting as syntactic sugar, with the above single line of RDF* expanding to the 6 triples above.

RDF* is not yet a W3 standard, but a large number of tools support it. It is accompanied by SPARQL* for queries.

There is a lot more to be said about the topic of edge properties in LPGs and RDF, I will try to cover these in future posts. This includes:

  • Alternatives to RDF reification, of which there are many
    • Named Graphs, which capitalize on the fact that triplestores are actually quad stores, and use the graph with which a triple is associated with as a site of attachment for edge properties.
    • The Singleton Property Pattern (SPP). This has some adherents, but is not compatible with OWL-DL modeling
    • Alternative Reification Vocabularies. This includes the OWL reification vocabulary. It’s immensely depressing and confusing and under-appreciated that OWL did not adopt the RDF reification vocabulary, and the OWL stack fails horribly when we try and use the two together. Additionally OWL reification comes with annoying limitations (see my answer on stack overflow about RDF vs OWL reification).
    • RDF* can be seen as an alternative or it can be seen as syntactic sugar and/or a layer of abstraction over existing RDF reification
    • various other design patterns such as those in https://www.w3.org/TR/swbp-n-aryRelations/
  • Semantics of reification. RDF has monotonic semantics. This means that adding new triples (including reification triples) cannot retract the meaning of any existing triples (including the reified triples). So broadly speaking, it’s fine to annotate a triple with metadata (e.g. who said it), but not with something that alters it’s meaning (e.g. a negation qualifier, or probabilistic semantics). This has implications on how we represent knowledge graphs in RDF, and on proposals for simpler OWL layering on RDF. It also has implications for inference with KGs, both classic deductive boolean inference as well as modern KG embedding and associated ML approaches (e.g node2vec, embiggen).
  • Alternate syntaxes and tooling that is compatible with RDF and employs higher level abstractions above the verbose bloated reification syntax/model above. This includes RDF*/SPARQL* as well as KGX.

The Open World Assumption Considered Harmful


A frequent source of confusion with ontologies and more generally with any kind of information system is the Open World Assumption. This trips up novice inexperienced users, but as I will argue in this post, information providers could do much more to help these users. But first an explanation of the terms:

With the Open World Assumption (OWA) we do not make any assumptions based on the absence of statements. In contrast, with the Closed World Assumption (CWA), if something is not explicitly stated to be true, it is assumed to be false. As an example, consider a pet-owner database with the following facts:

Fred type Human .
Farrah type Human .

Foofoo type Cat .
Fido type Dog .

Fred owns Foofoo .
Farrah owns Fido.

Depicted as:

depiction of pet owners RDF graph, with triples indicated by arrows. RDF follows the OWA: the lack of a triple between Fred and Fido does not entail that Fred doesn’t own Fido.

Under the CWA, the answer to the question “how many cats does Fred own” is 1. Similarly, for “how many owners does Fido have” the answer also 1.

RDF and OWL are built on the OWA, where the answer to both question is: at least 1. We can’t rule out that Fred also owns Fido, or that he owns animals not known to the database. With the OWA, we can answer the question “does Fred own Foofoo” decisively with a “yes”, but if we ask “does Fred own Fido” the answer is “we don’t know”. It’s not asserted or entailed in the database, and neither is the negation.

Ontology formalisms such as OWL are explicitly built on the OWA, whereas traditional database systems have features constructed on the CWA.

OWL gives you mechanisms to add closure axioms, which allows you to be precise about what is known not be to true, in addition what is known to be true. For example, we can state that Fred does not own Fido, which closes the world a little. We can also state that Fred only owns Cats, which closes the world further, but still does not rule out that Fred owns cats other than Foofoo. We can also use an OWL Enumeration construct to exhaustively list animals Fred does own, which finally allows the answer to the question “how many animals does Fred own” with a specific number.

OWL ontologies and databases (aka ABoxes) often lack sufficient closure axioms in order to answer questions involving negation or counts. Sometimes this is simply because it’s a lot of work to add these additional axioms, work that doesn’t always have a payoff given typical OWL use cases. But often it is because of a mismatch between what the database/ontology author thinks they are saying, and what they are actually saying under the OWA. This kind of mismatch intent is quite common with OWL ontology developers.

Another common trap is reading OWL axioms such as Domain and Range as Closed World constraints, as they might be applied in a traditional database or a CWA object-oriented formalism such as UML.

Consider the following database plus ontology in OWL, where we attempt to constrain the ‘owns’ property only to humans

owns Domain Human
Fido type Dog
Fred type Human
Fido owns Fred

We might expect this to yield some kind of error. Clearly using our own knowledge of the world something is amiss here (likely the directions of the final triple has been accidentally inverted). But if we are to feed this in to an OWL reasoner to check for incoherencies (see previous posts on this topic), then it will report everything as consistent. However, if we examine the inferences closely, we will see that it is has inferred Fido to be both a Dog and a Human. It is only after we have stated explicit axioms that assert or entail Dog and Human are disjoint that we will see an inconsistency:

OWL reasoner entailing Fido is both a Dog and a Human, with the latter entailed by the Domain axiom. Note the ontology is still coherent, and only becomes incoherent when we add a disjointness axiom

In many cases the OWA is the most appropriate formalism to use, especially in a domain such as the biosciences, where knowledge (and consequently our databases) is frequently incomplete. However, we can’t afford to ignore the fact that the OWA contradicts many user expectations about information systems, and must be pragmatic and take care not to lead users astray.

BioPAX and the Open World Assumption

BioPAX is an RDF-based format for exchanging pathways. It is supposedly an RDF/OWL-based standard, with an OWL ontology defining the various classes and properties that can be used in the RDF representation. However, by their own admission the authors of the format were not aware of OWL semantics, and the OWA specifically, as explained in the official docs in the level 2 doc appendix, and also further expanded on in a paper from 2005 by Alan Ruttenberg, Jonathan Rees, and Joanne Luciano, Experience Using OWL DL for the Exchange of Biological Pathway Information, in particular the section “Ambushed by the Open World Assumption“. This gives particular examples of where the OWA makes things hard that should be easy, such as enumerating the members of a protein complex (we typically know all the members, but the BioPAX RDF representation doesn’t close the world).

BioPAX ontology together with RDF instances from EcoCyc. Triples for a the reaction 2-iminopropanoate + H2O → pyruvate + ammonium is shown. The reaction has ‘left’ and ‘right’ properties for reactants such as H2O. These are intended to be exhaustive but the lack of closure axioms means that we cannot rule out additional reactants for this reaction.

The Ortholog Conjecture and the Open World Assumption

gene duplication and speciation, taken from http://molecularevolutionforum.blogspot.com/2012/12/ortholog-conjecture-debated.html

In 2011 Nehrt et al made the controversial claim that they had overturned the ortholog conjecture, i.e they claimed that orthologs were less functionally similar than paralogs. This was in contrast to the received wisdom, i.e if a gene duplicates with a species lineage (paralogs) there is redundancy and one copy is less constrained to evolve a new function. Their analysis was based on semantic similarity of annotations in the Gene Ontology.

The paper stimulated a lot of discussion and follow-up studies and analyses. We in the Gene Ontology Consortium published a short response, “On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report“. In this we pointed out that the analysis assumed the CWA (absence of function assignment means the gene does not have that function), whereas GO annotations should be interpreted under the OWA (we have an explicit way of assigning that a gene does not have a function, rather than relying on absence). Due to bias in GO annotations, paralogs may artificially have higher functional similarity scores, rendering the original analysis insufficient to reject the ortholog conjecture.

The OWA in GO annotations is also addressed in the GO Handbook in the chapter Pitfalls, Biases, and Remedies by Pascale Gaudet. This chapter also makes the point that OWA can be considered in the context of annotation bias. For example, not all genes are annotated at the same level of specificity. The genes that are annotated reflect biases in experiments and publication, as well as what is selected to be curated.

Open World Assumption Considered (Sometimes) Harmful

The OWA is harmful where it grossly misaligns with use expectations.

While a base assumption of OWA is usually required with any curated information, it is also helpful to think in terms of an overriding implicit contract between any information provider and information consumer: any (good) information provider attempts to provide as complete information as is possible, given resource constraints.

My squid has no tentacles

Let’s take an example: If I am providing an ontology I purport to be an anatomical ontology for squid, then it behooves me to make sure the main body parts found in a squid are present.

Chiroteuthis veranii, the long armed squid, By Ernst Haeckel, with two elongated tentacles.

Let’s say my ontology contains classes for some squid body parts such as eye, brain, yet lacks classes for others such as the tentacle. A user may be surprised and disappointed when they search for tentacle and come back empty-handed (or empty tentacled, if they are a squid user). If this user were to tell me that my ontology sucked, I would be perfectly within my logical rights to retort: “sir, this ontology is in OWL and thus follows the Open World Assumption; as such the absence of a tentacle class in my squid ontology does not entail that squids lack tentacles, for such a claim would be ridiculous. Please refer to this dense interlinked set of documents published by the W3C that requires PhD in logic to understand and cease from making unwarranted assumptions about my ontology“.

Yet really the user is correct here. There should be an assumption of reasonable coverage, and I have violated that assumption. The tentacle is a major body part, it’s not like I have omitted some obscure neuroanatomical region. Is there a hard and fast dividing line here? No, of course not. But there are basic common sense principles that should be adhered to, and if they cannot be adhered to, omissions and biases should be clearly documented in the ontology to avoid confusing users.

This hypothetical example is made up, but I have seen many cases where biases and omissions in ontologies confusingly lead the user to infer absence where the inference is unwarranted.

Hydroxycholoroquine and COVID-19

The Coronavirus Infectious Disease Ontology (CIDO) integrates a number of different ontologies and includes axioms connecting terms or entities using different object properties. An example is the ‘treatment-for’ edge which connects diseases to treatments. Initially the ontology only contained a single treatment axiom, between COVID-19 and Hydroxychloroquine (HCQ). Under the OWA, this is perfectly valid: COVID-19 has been treated with HCQ (there is no implication about whether treatment is successful or not). However, the inclusion of a single edge of this type is at best confusing. A user could be led to believe there was something special about HCQ compared to other treatments, and the ontology developers had deliberately omitted these. In fact initial evidence for HCQ as a successful treatment has not panned out (despite what some prominent adherents may say). There are many other treatments, many of which are in different clinical trial phases, many of which may prove more effective, yet assertions about these are lacking in CIDO. In this particular case, even though the OWA allows us to legitimately omit information, from a common sense perspective, less is more here: it is better to include no information about treatments at all rather than confusingly sparse information. Luckily the CIDO developers have rapidly addressed this situation.

Ragged Lattices, including species-specific classes

An under-appreciated problem is the confusion ragged ontology lattices can cause users. This can be seen as a mismatch between localized CWA expectations on the part of the user and OWA on the part of the ontology provider. But first an explanation of what I mean by ragged lattice:

Many ontologies are compositional in nature. In a previous post we discussed how the Rector Normalization pattern could be used to automate classification. The resulting multi-parent classification forms a lattice. I have also written about how we should embrace multiple inheritance. One caveat to both of these pieces is that we should be aware of the confusion that can be caused by inconsistently populated (‘ragged’) lattices.

Take for example cell types, which can be classified along a number of orthogonal axes, many intrinsic to the cell itself – its morphological properties, it’s lineage, its function, or gene products expressed. The example below shows the leukocyte hierarchy in CL, largely based on intrinsic properties:

Protege screenshot of the cell ontology, leukocyte hierarchy

Another way to classify cells is by anatomical location. In CL we have a class ‘kidney cell’ which is logically defined as ‘any cell that is part of a kidney’. This branch of CL recapitulates anatomy at the upper levels.

kidney cell hierarchy in CL, recapitulating anatomical classification

so far, perfectly coherent. However, the resulting structure can be confusing to someone now used to thinking in OWL and the OWA. I have seen many instances where a user will go to a branch of CL such as ‘kidney cell‘ and start looking for a class such as ‘mast cell‘. It’s perfectly reasonable for them to look here, as mast cells are found in most organs. However, CL does not place ‘mast cell’ as a subclass of ‘kidney cell’ as this would entail that all mast cells are found in the kidney. And, CL has not populated the cross-product of all the main immune cell types with the anatomical structures in which they can be found. The fleshing out of the lattice is inconsistent, leading to confusion caused by violation of an assumed contract (provision of a class “kidney cell” and incomplete cell types underneath).

This is even more apparent if we introduce different axes of classification, such as the organism taxon in which the cell type is found, e.g. “mouse lymphocyte”, “human lymphocyte”:

inferred hierarchy when we add classes following a taxon design pattern, e.g. mouse cell, mouse lymphocyte. Only a small set of classes in the ontology are mouse specific.

Above is a screenshot of what happens when we introduce classes such as ‘mouse cell’ or ‘mouse lymphocyte’. We see very few classes underneath. Many people indoctrinated/experienced with OWL will not have a problem with this, they understand that these groupings are just for mouse-specific classes, and that the OWA holds, and absence of a particular compositional class, e.g. “mouse neuron” does not entail that mice do not have neurons.

One ontology in which the taxon pattern does work is the protein ontology, which includes groupings like “mouse protein”. PRO includes all known mouse proteins under this, so the classification is not ragged in the same way as the examples above.

There is no perfect solution here. Enforcing single inheritance does not work. Compositional class groupings are useful. However, ontology developers should try and avoid ragged lattices, and where possible populate lattices consistently. We need better tools here, e.g. ways to quantitative measure the raggedness of our ontologies.

Ontologies and databases should better document biases and assumptions

As providers of information, we need to do a better job of making all assumptions explicit and well-documented. This applies particularly to any curated corpus of knowledge, but in particular to ontologies. Even though hiding behind the OWA is logically defensible, we need to make things more intuitive for users.

It’s not uncommon for an ontology to have excellent very complete coverage of one aspect of the domain, and to be highly incomplete in another (reflecting either the biases/interests of the developers, or of the broader field). In fact I have been guilty of this in ontologies I have built or contributed to. I have seen users become confused when a class they expected to find was not present, or they have been perplexed by the ragged lattice problem, or an edge they expected to find was not present.

Few knowledge bases can ever be complete, but we can do better at documenting known unknowns or incompletenesses. We can imagine a range of formal computable ways of doing this, but a good start would be some simple standard annotation properties that can be used as inline documentation in the ontology. Branches of the ontology could be tagged in this way, e.g. to indicate that ‘kidney cell’ doesn’t include all cells found in the kidney, only kidney specific ones; or that developmental processes in GO are biased towards human and model organisms. This system could also be used for Knowledge Graphs and annotation databases too, to indicate that particular genes may be under-studied or under-annotated, an extension of the ND evidence type used in GO.

In addition we could do a better job at providing consistent levels of coverage of annotations or classes. There are tradeoffs here, as we naturally do not want to omit anything, but we can do a better job at balancing things out. Better tools are required here for detecting imbalances and helping populate information in a more balanced consistent fashion. Some of these may already exist and I’m not aware of them – please respond in the comments if you are aware of any!

A simple standard for sharing mappings

A common pain point for anyone working in bioinformatics is mapping identifiers. Many databases have overlapping content, or provide different data about the same entities, such as genes. Typically every database mints public identifiers in its own namespace. This means that the same gene may have multiple different identifiers in different databases (see for example some of the issues with SARS-CoV-2 protein identifiers). Anyone doing an analysis that combines data from different databases must do some kind of cross-walk or mapping.

Unfortunately mapping is fraught with problems. Simply finding the required mappings can be a challenge, and for any given pair of databases there may be different mappings from different providers. The provider may be the source databases, or a 3rd party provider such as BridgeDb. The meaning of a mapping may not be clear: does a mapping denote equivalence, or at least a 1:1 relationship? This is particularly challenging when trying to build a knowledge graph from multiple sources, where we want to merge information about the same entity.

Mappings are a big deal for ontologies too. There is an entire field of ontology alignment/matching. In theory ontologies should be able to make the meaning of mappings explicit, yet somehow we have messed this up, providing multiple alternate ways to say the same thing (OWL logical expressions, SKOS, and classic loose bioinformatics ‘dbxrefs’).

Within the Open Bio Ontologies project we attempted to avoid the mapping issue by promoting reuse of ontologies and concepts from ontologies — including reusing identifiers/URIs. Reuse is a standard concept in software engineering, and I’ve written before about (re)using software concepts in ontology engineering. However, not all ontologies are in OBO, and not all ontologies in OBO are perfectly modular and non-overlapping, so mapping remains a necessary evil.

Mappings between Ontologies visualized in the OLS OxO tool

Overall there are a multitude of headaches and challenges associated with mappings. One tractable chunk we have tried to break off recently is to come up with a standard exchange format for mappings. Currently mappings are distributed in ad-hoc formats, and there is no standard way of providing metadata about the mappings (who provided them, when, how, what is their quality, etc).

SSSOM: A Simple Shared Standard for Ontology Mapping

We have recently come up a proposed standard for mappings. We came up with the name SSSOM. A few initial comments about the name:

  • it stands for Simple Shared Standard for Ontology Mapping. However, I believe it’s completely applicable to any named entity, whether modeled ontologically or not. You can use SSSOM for genes, proteins, people
  • Yes, that is 3 leading ‘S’s. I have a colleague who calls it ‘slytherin

Details can be found in the SSSOM repo. I’ll provide a brief summary here.

The primary manifestation of SSSOM is as a TSV (tab-separate value) file. However, the data model is independent of the serialization format, and it can also be modeled in OWL, and any OWL serialization is possible (including JSON-LD).

SSSOM provides a standard way to describe a lot of rich information about mappings, but it allows you to be lazy and provide a minimum amount of information. The following example shows some mappings between the human phenotype ontology and the mouse phenotype ontology:

Example SSSOM table (source can be found at https://github.com/OBOFoundry/SSSOM/blob/master/examples/embedded/mp-hp-exact-0.0.1.tsv)

Note that in this case we are using SKOS as mapping predicates, but other predicates can be used.

Identifiers and mapping set metadata

SSSOM does require that all entities are written as CURIEs, with defined prefixes. The prefix expansions to URIs are written in the header in the same way you would for a format like RDF/Turtle.

For the above example, the source example TSV can be found here. Note the header:

#creator_id: "https://orcid.org/0000-0002-7356-1779"
#curie_map:
#  HP: "http://purl.obolibrary.org/obo/HP_"
#  MP: "http://purl.obolibrary.org/obo/MP_"
#  skos: "http://www.w3.org/2004/02/skos/core"
#license: "https://creativecommons.org/publicdomain/zero/1.0/"
#mapping_provider: "http://purl.obolibrary.org/obo/upheno.owl"

The header is escaped by hash-quote marks, and is in YAML format. The curie_map tag provides expansions of prefixes to base URIs. It is recommended you use standard prefixes, such as the OBO prefixes (for ontologies) or the Biolink Model (which incorporates OBO, as well as other sources like identifiers.org via prefixcommons).

The header also allows for many other pieces of metadata about the mapping set. Inclusion of an explicit license is encouraged – and I would recommend CC-0 for all mappings.

Mapping metadata

The complete set of elements that can be used to describe a mapping can be found in the relevant section of the spec. Some of these can be used for individual mappings, some can be applied in the header if they apply to all.

Some elements to call out:

  • Each mapping can have an associated confidence score, between zero and one. This can be useful for probabilistic OWL based ontology merging approaches, e.g. LogMap and kBOOM/Boomer.
  • Mappings and mapping sets can have provenance – e.g. orcid of the mapping_provider, as well as the date of the mapping
  • Information about how the match was made, such as the mapping tool, the type of match (e.g. automated lexical vs curated). We have developed a controlled vocabulary for this. For lexical matches, you can also indicate what property was matched (e.g a match may be based on a shared oboInOwl:hasDbXref in common, a shared rdfs:label, or a shared label/snonym pair).

OWL serialization

See the docs on OWL.

Example of use outside ontologies

The metadata_converter project is intended to provide schema mappings between metadata schemes. The current focus is on standards used in environmental ‘omics’, of interest to the NMDC, such as MIxS, NEON, DarwinCore, and SESAR/IGSN.

The mappings between schema elements in SSSOM format can be found here.

Tooling

SSSOM is a new standard, and may change based on community feedback, so there is not much tooling yet.

We have an early version of a Python toolkit for working with SSSOM files:

https://sssom-py.readthedocs.io/en/latest/

Additionally, rdf_matcher generates SSSOM files (more on rdf_matcher in a future post). Boomer will be adapted to take SSSOM as an input for probabilistic axioms.

Feedback welcome!

We welcome comments, criticism, questions, requests for new metadata elements etc on our tracker.

For the current version of SSSOM we are indebted to the following people who crafted the spec:

  • Ernesto Jimenez-Ruiz (City, University of London)
  • John Graybeal (Stanford)
  • William Duncan (LBL)
  • David Osumi-Sutherland (EMBL-EBI)
  • Simon Jupp (SciBite)
  • James McLaughlin (EMBL-EBI)
  • Henriette Harmse (EMBL-EBI)

The person responsible for the vast majority of the work on SSOM is Nicolas Matentzoglu who crafted the spec, wrote the metadata ontology, served as community liaison and coordinated feedback

What is the SARS-CoV-2 molecular parts list?

There is a lot we still have to learn about SARS-CoV-2 and the disease it causes in humans. One aspect of the virus that we do know a lot about is its underlying molecular blueprint. We have the core viral genome, and broadly speaking we know the ‘parts list’ of proteins that are translated and spliced from the genome. There is a lot that we still don’t know about the proteins themselves – how variations affect the ability of the virus to infect a host, which molecules bind to these proteins and how that binding impacts their function. But at least we know the basic parts list. Or we think we do. There is the Spike protein (S), which adorns the surface of this virus like a crown, hence the name ‘coronavirus’. There are the 16 ‘non-structural proteins’ formed by cleavage of a viral polyprotein; one such protein is nsp5 which functions as a protease that performs this same cleavage. And there are the accessory proteins, such as the mysterious ORF8. The genomic blueprint and the translated and cleaved products can be illustrated visually:

SARS-CoV-2 genome and protein products. The ORF1a/1ab polyprotein is cleaved into cleavage products (non-structural proteins; nsp1-16). Note that there are two overlapping polyproteins 1a and 1b, only 1ab is shown here for simplicity. Image taken from https://www.nytimes.com/interactive/2020/04/03/science/coronavirus-genome-bad-news-wrapped-in-protein.html

Each of these proteins has a variety of different names, for example, nsp3 is also known as PLpro. The genome is small enough that most scientists working on it have memorized the core aliases such that human-to-human communication is relatively unproblematic.

Of course, as we all know, relying on gene and protein symbols for unique identification in a database for machine-machine communication is a recipe for disaster. Symbols are inherently ambiguous, so we assign identifiers to entities in order to disambiguate them. These identifiers can then be adorned with metadata such as symbols, names, aliases, descriptions, functional descriptions and so on.

As everyone working in bioinformatics knows, different databases assign different identifiers for the same entity (by varying definitions of ‘same’), creating the ubiquitous identifier mapping problem and a cottage industry in mapping solutions.

This is a perennial problem for all omics entities such as genes and proteins, regardless of the organism or system being studied. But when it comes to SARS-CoV-2, things are considerably worse.

It turns out that many problems arise from the relatively simple biological phenomena of cleavage of viral polyproteins. While the molecular biology is not so difficult (one parent protein as a source for many derivative proteins), many bioinformatics databases are not designed with this phenomena in mind. This is fine for scenarios where we can afford to gloss over differences between the immediate products of translation and downstream cleavage products. While cleavage certainly happens in the human genome (e.g POMC), it’s rare enough to effectively ignore in some contexts (although arguably this causes a lot of problems too). However, the phenomena of a single translation product producing multiple functionally discrete units is much more common in viruses, which creates issues for many databases when creating a useful ‘canonical parts list’.

The roll-up problem

The first problem is that many databases either ignore the cleavage products or don’t assign them identifiers in the same category as other proteins. This has the effect of ‘rolling up’ all data to the polyprotein. This undercounts the number of true proteins, and does not provision identifiers for distinct functional entities.

For example, NCBI Gene does a fantastic job of assembling the genetic parts lists for genes across all cellular organisms and viruses. Most of the time, the gene is an appropriate unit of analysis, and we can use gene identifiers as proxies for the product transcribed and translated from that gene. In the case of SARS-CoV-2, NCBI mints a gene ID for the polyprotein (e.g. 1ab), but lacks distinct gene IDs for individual cleavage products ,even though each arguably fulfill the definition of discrete genes, and each is a discrete non-overlapping unit with a distinct function. Referring to the figure above, nsp1-10 are all ‘rolled up’ into the 1ab or 1a polyprotein entity.

Now this is perhaps understandable given that the NCBI Gene database is centered on genes (they do provide distinct protein IDs for the products, see later), and the case can be made that we should only have gene IDs for the immediate protein products (e.g polyproteins and structural proteins and accessory ORFs).

But the roll-up problem also exists for dedicated protein databases such as UniProt. UniProt mint IDs for polyproteins such as 1ab, but there is no UniProt accession for nsp1-16. These are ‘rolled up’ into the 1ab entry, as shown in the screenshot:

UniProt entry for viral polyprotein 1ab. Function summaries for distinct proteins (nsp1-3 shown, others below the fold) are rolled up to the polyprotein level

However, UniProt do provide identifiers for the various cleavage products, these are called ‘chain IDs’, and are of the form PRO_nnnnn. For example, an identifier for the nsp3 product is PRO_0000449621). Due to the structure of these IDs they are sometimes called ‘PRO IDs’ (However, they should not be confused with IDs from the Protein Ontology, which are also called ‘PRO IDs’. Confusing, huh?).

UniProt ‘chain’ IDs, with nsp3 highlighted. These do not get distinct accessions and do not get treated in the same way as a ‘full’ accessioned protein entry

Unfortunately these chain IDs are not quite first-class citizens in the protein database world. For example, the fantastic InterproScan pipeline is executed on the polyproteins, not the chain IDs. This means that domain and GO function calls are made at the level of the polyprotein, so it looks to a machine like there is one super-multifunctional protein that acts as a protease, ADP-ribose binding protein, autophagosome induction, etc. In one sense this is sort of true, but I don’t think it’s a very useful way of looking at protein function. It is more meaningful to assign the functions at the level of the individual cleavage products. It is possible to propagate the interproscan-assigned annotations down to the NSPs using the supplied coordinates, but it should not fall on consumers to do this extra processing step.

The not-quite-first-class status of these identifiers also causes additional issues. For example different ways to write the same ID (P0DTD1-PRO_0000449621 vs P0DTD1:PRO_0000449621 vs P0DTD1#PRO_0000449621 vs simply PRO_0000449621), and no standard URL (although UniProt is working on these issues).

The identical twin identifier problem

An additional major problem is the existence of two distinct identifiers for each of the initial non-structural proteins. Of course, we live with multiple identifiers in bioinformatics all the time, but we generally expect a single database to assign a single identifier for a single entity. Not so!

The problem here is the fact there is a ribosomal frameshift in the translation of the polyprotein in SARS-CoV-2 (again, the biology here is fairly basic), which necessitates two distinct database entries; here: each (called 1ab; aka P0DTD1 and 1a; aka P0DTC1). So far so good. However, while these are truly distinct polyproteins, the non-structural proteins cleaved from them are identical up until the frameshift. However, due to an assumption in databases that each cleavage product must have one ‘parent’, IDs are duplicated. This is shown in the following diagram:

Two polyproteins 1ab and 1a (this is shown an 1a and 1b here, in fact the 1ab pp covers both ORFs). Each nsp 1-10 gets two distinct IDs depending on the ‘parent’ despite sequence identity, and as far as we know, the same function. Diagram courtesy of ViralZone/SIB

While on the surface this may seem like a trivial problem with some easy workarounds, in fact this representation breaks a number of things. First it artificially inflates the proteome making it seems there are more proteins than they actually are. A parts list is less useful if it has to be post-processed in ad-hoc ways to get the ‘true’ parts list.

It can make it difficult when trying to promote the use of standard database identifiers over protein symbols because an arbitrary decision must be made, and if I make a different arbitrary decision from you, then our data does not automatically integrate. Ironically, using standard protein symbols like ‘nsp3’ may actually be better for database integration than identifiers designed for that purpose!

And when curating something like a protein interaction database or a pathway database an orthology database or assembling a COVID Knowledge Graph that deals with pairwise interactions, we must either choose arbitrarily or fully populate the cross-product of all pair combos. E.g. if nsp3 in SARS is orthologous to nsp3 in SARS-CoV-2, then we have to make four statements instead of one.

While I focused on UniProt IDs here, other major resources such as NCBI also have these duplicates in their protein database for the sam reason.

Kudos to Wikidata and the Protein Ontology

Two resources I have seen that gets this right are the Protein Ontology and Wikidata.

The Protein Ontology (aka PR, sometimes known as PRO; NOT to be confused with ‘PRO’ entries in UniProt) includes a single first-class identifier/PURL for each nsp, for example nsp3 has CURIE PR:000050272 (http://purl.obolibrary.org/obo/PR_000050272). It has mappings to each of the two sequence-identical PRO chain IDs in UniProt. It also has distinct entries for the parent polyprotein, and it has meaningful ontologically encoded edges linking the two (SARS-CoV-2 protein ontology available from https://proconsortium.org/download/development/pro_sars2.obo)

Protein Ontology entry for nsp3
Protein Ontology entry for SARS-CoV-2 nsp3, shown in obo format syntax (obo format is a human-readable concrete syntax for OWL).

Wikidata also does a good job of providing a single canonical identifier that is 1:1 with distinct proteins encoded by the SARS-CoV-2 genome (for example, the entry for nsp3 https://www.wikidata.org/wiki/Q87917581). However, it is not as complete. Sadly it does not have mappings to either the protein ontology or the UniProt PRO chain IDs (remember: these are different!).

The big disadvantage of Wikidata and the Protein Ontology over the big major sequence databases is that they are not the big major sequence databases. They suffer a curation lag (one employing crowdsourcing, the other manual curation) whereas the main protein databases automate more albeit at the expense of quirks such as non-first-class IDs and duplicate IDs. Depending on the use case, this may not be a problem. Due to the importance of the SARS-CoV-2 proteome, sufficient resources were able to be marshalled on this effort. But will this scale if we want unique non-dupplicate IDs for all proteins in all coronaviruses – including all the different ones infecting bats and other hosts?

A compromise solution

When building KG-COVID-19 we needed to decide which IDs to use as canonical for SARS-CoV-2 genes and proteins. While our system is capable of working with alternate IDs (either normalizing during the KG build stage, or post build as part of a clique-merge step), it is best to avoid these. Mapping IDs can lead to either unintentional roll-ups (information about the cleavage product propagating up to the polyprotein) or worse, fanning-out (rolled up information then spreading to ‘sibling’ proteins); or if 1:1 is enforced the overall system is fragile.

We liked the curation work done by the Protein Ontology, but we knew (1) we needed a system that we could instantly get IDs for proteins in any other viral genome (2) we wanted to be aligned with sources we were ingesting, such as the IntAct curation of the Gordon et al paper, and Reactome plus GO-CAM curation of viral-host pathways. This necessitating the choice of a major database.

Working with the very helpful UniProt team in concert with IntAct curators we were able to ascertain that of the duplicate entries, by convention we should take the ID that comes from the longer polyprotein as the ‘reference’. For example, nsp3 has the following chain IDs:

  • P0DTC1-PRO_0000449637 (from the shorter pp: 1a) [NO]
  • P0DTD1-PRO_0000449621 (from the longer pp: 1ab) [YES]

(Remember, these are sequence-identical and as far as we know functionally identical).

In this case, we take PRO_0000449621 as the canonical/reference entry. This is also the entry IntAct use to curate interactions. We pretend that PRO_0000449637 does not exist.

This is very far from perfect. Biologically speaking, it’s actually the shorter pp that is more commonly expressed, so the choice of the longer one is potentially confusing. These is also the question of how UniProt should propagate annotations. It is valid to propagate from one chain ID to its ‘identical twin’. But what about when these annotations reference other cleavage products (e.g pairwise functional annotation from a GO-CAM, or an ortholog). Do we populate the cross-product? This could get confusing (my interest in this was both from the point of view of our COVID KG, but also wearing my GO hat)

Nevertheless this was the best compromise we could find, and we decided to follow this convention.

Some of the decisions are recorded in this presentation

Working with the UniProt and IntAct teams we also came up with a standard way to write IDs and PURLs for the chain IDs (CURIEs are of the form UniProtKB:ACCESSION-PRO_NNNNNNN). While this is not the most thrilling or groundbreaking development in the battle against coronaviruses, it excites me as it means we have to do far less time consuming and error prone identifier post-processing just to make data link up.

As part of the KG-COVID-19 project, Marcin Joachimiak coordinated the curation of a canonical UniProt-centric protein file (available in our GitHub repository), leveraging work that had been done by UniProt, the protein ontology curators, and the SciBite ontology team. We use UniProt IDs (either standard accessions, for polyproteins, structural proteins, and accessory ORFs; or chain IDs for NSPs) This file differs from the files obtained directly from UniProt, as we include only reference members of nsp ‘twins’, and we exclude less meaningful cleavage products.

This file lives in GitHub (we accept Pull Requests) and serves as one source for building our KG. The information is also available in KGX property graph format, or as RDF, or can be queried from our SPARQL endpoint.

We are also coordinating with different groups such as COVIDScholar to use this as a canonical vocabulary for text mining. Previously groups performing concept recognition on the CORD-19 corpus using protein databases as dictionaries missed the non-structural proteins, which is a major drawback.

Imagine a world

In an ideal world posts like this would never need to be written. There is no panacea; however, systems such as the Protein Ontology and Wikidata which employ an ontologically grounded flexible graph make it easier to work around legacy assumptions about relationships between types of molecular parts (see also the feature graph concept from Chado). The ontology-oriented basis makes it easier to render implicit assumptions explicit, and to encode things such as the relationship between molecular parts in a computable way. Also embracing OBO principles and reusing identifiers/PURLs rather than minting new ones for each database could go some way towards avoiding potential confusion and duplication of effort.

I know this is difficult to conceive of in the current landscape of bioinformatics databases, but paraphrasing John Lennon, I invite you to:

Imagine no proliferating identifiers
I wonder if you can
No need for mappings or normalization

A federation of ann(otations)
Imagine all the people (and machines) sharing all the data, you

You may say I’m a dreamer
But I’m not the only one
I hope some day you’ll join us
And the knowledge will be as one

Building a COVID-19 Knowledge Graph

With COVID-19 cases continuing to grow in number across the globe, scientists are forming new collaborations in order to better understand all aspects of SARS-CoV-2 together with its impact on human health. One aspect of this is organizing existing and emerging information about viral and host cell molecular biology, disease epidemiology, phenotypic progression, and effect of drugs and other treatments in individuals.

Knowledge Graphs (KGs) provide a way to organize complex heterogeneous information connecting different biological and clinical entities such as genes, drugs, diseases, exposures, phenotypes, and pathways.

For example, the following image shows a graphical (network) representation of SARS-CoV-2 proteins and host human proteins they are hypothesized to interact with, together with existing known human-human protein interactions, annotated with GO terms and drug target information:

ETw3ESsX0AQIFrK

SARS-CoV-2 host interaction map; taken from https://www.biorxiv.org/content/10.1101/2020.03.22.002386v1

Graphs such as this can be further extended with other information about the human and viral genes as it becomes available. Mechanisms such as endocytosis can also be included as nodes in the graph, as well as expression states of relevant human cells, etc.  Existing ontologies like GO, HPO, Mondo, and CHEBI, together with their annotations can be conceived of as KGs.

Screen Shot 2020-04-05 at 7.38.51 PM

Portion of a KG formed from GO, Mondo, HPO, Genes, and their inter-relationships

These KGs can be used as data warehouses for querying data integrated in a single place. They can also be used as sources in Machine Learning, for tasks such as link prediction. For example: which compounds might be likely to treat a particular disease, based on properties of both the compound and the disease.

The KG-COVID-19 Knowledge Graph Hub

As part of a collaboration between the Monarch Initiative, the Illuminating the Druggable Genome KG project, and PheKnowLater we have been collaboratively building a KG for COVID-19. All of the source is in GitHub, in the Knowledge-Graph-Hub/kg-covid-19 repository.

The project is built around the concept of a KG “Hub”, a lightweight way to build a KG from multiple upstream sources. Any developer can follow the instructions to ingest a new source, and make a Pull Request on the repo. So far we have a number of different sources ingested (detailed in the yaml file), and more on the way. The output is a simple biolink-model compliant KG in a simple TSV format that is compatible with Property Graphs (e.g. Neo4J) as well as RDF graphs. In all cases we use CURIEs that are equivalent to standard URIs, such as OBO Class PURLs.

One of the goals is to use this alongside our N2V framework to discover new links (for example, identifying existing drugs that could be repurposed to treat COVID-19) and generate actionable knowledge.

Screen Shot 2020-04-05 at 7.00.12 PM

 

Knowledge Graphs at the Virtual Biohackathon

The COVID-19 Biohackathon is a virtual event starting today (April 5 2020), lasting for a week, with the goal to “create a cohesive effort and work on tooling for COVID-19 analysis. The biohackathon will lead to more readily accessible data, protocols, detection kits, protein predictions etc.“. The Biohackathon was spearheaded by many of the same people behind the yearly Biohackathon which I have previously reported on.

One of the subgroups at the hackathon is the KnowledgeGraph group. This includes the kg-covid-19 contributors and other luminaries from the life sciences linked data / KG world, including neXtProt, UniProt, KnetMiner, Monarch, HPO, IDG-KG, GO.

I’m excited to see all these people working together as part of a dynamic group to produce tools that aim to help elucidate some of the biology underlying this critical threat. Of course, this is just one very small part of a massive global effort (really what we need to tackle COVID-19 is better public health infrastructure, widespread testing, ventilators, PPE for medical staff and workers on the front line, etc, see How the Pandemic Will End by Ed Jong). But I also think that this is an opportunity for collaborating on some of the crucial knowledge-based tools that have wide applications in biomedicine.

If you want to know more, the details of the biohackathon can be found on its GitHub page, and the kg-covid-19 repository can be found here, with contributor guidelines here.

 

 

Using Wikidata for crowdsourced language translations of ontologies

In the OBO world, most of our ontologies are developed by and for English-speaking audiences. We would love to have translations of labels, synonyms, definitions, and so on in other languages. However, we lack the resources to do this (an exception is the HPO, which includes high quality curated translations for many languages).

Wikipedia/Wikidata is an excellent source of crowdsourced language translations. Wikidata provides language-neutral concept IDs that link multiple language-specific Wikipedia pages. Wikidata also includes mappings to ontology class IDs, and provides a SPARQL endpoint. All this can be leveraged for a first pass at language translations.

For example, the Wikidata entity for badlands is mapped to the equivalent ENVO class PURL. This entity in Wikidata also has multiple rdfs:label annotations (maximum one per language).

We can query Wikidata for all rdfs:label translations for all classes in ENVO. I will use the sparqlprog_wikidata framework to demonstrate this:

pq-wikidata ‘envo_id(C,P),label(C,N),Lang is lang(N)’

This compiles down to the following SPARQL which is then executed against the Wikidata endpoint:

SELECT ?c ?p ?n ?lang WHERE {?c <http://www.wikidata.org/prop/direct/P3859&gt; ?p . ?c <http://www.w3.org/2000/01/rdf-schema#label&gt; ?n . BIND( LANG(?n) AS ?lang )}

the results look like this:

wd:Q272921,00000127,badlands,en
wd:Q272921,00000127,Badlandoj,eo
wd:Q272921,00000127,Tierras baldías,es
wd:Q272921,00000127,Badland,et
wd:Q272921,00000127,بدبوم,fa
wd:Q272921,00000127,Badlands,fr
wd:Q272921,00000127,בתרונות,he
wd:Q272921,00000127,Badland,hu
wd:Q272921,00000127,Բեդլենդ,hy
wd:Q272921,00000127,Calanco,it
wd:Q272921,00000127,悪地,ja

Somewhat disappointingly, there are relatively few translations for ENVO. But this is because the Wikidata property for mapping to ENVO is relatively new. We actually have a large number of outstanding new Wikidata to ENVO mappings we need to upload. Once this is done the coverage will increase.

Of course, different ontologies will differ in how their coverage maps to Wikidata. In some cases, ontologies will include many more concepts; or the corresponding Wikidata entities will have fewer or no non-English labels. But this will likely decrease over time.

There may be other ways to increase coverage. Many ontology classes are compositional in nature, so a combination of language translations of base classes plus language specific encodings of grammatical patterns could yield many more. The natural place to add these would be in the manually curated .yaml files used to specify ontology design patterns, through frameworks like DOSDP. And of course, there is a lot of research in Deep Learning methods for language translation. A combination of these methods could yield high coverage with hopefully good accuracy.

As far as I am aware, these methods have not been formally evaluated. Doing an evaluation will be challenging as it will require high-quality gold standards. Ontology developers spend a lot of time coming up with the best primary label for classes, balancing ontological correctness, elimination of ambiguity, understanding of usage of terms by domain specialists, and (for many ontologies, but not all) avoiding overly abstruse labels. Manually curated efforts such as the HPO translations would be an excellent start.

SPARQLProg at BioHackathon 2019

I’m at the 2019 BioHackathon in Fukuoka. This is my first BioHackathon, and I am loving it so far!

We have organized ourselves into different hacking groups, with a lot of interactions between them. There is a lot of cool stuff going on in cutting edge areas such as genome graphs and Markov logic networks. I’m getting FOMO wishing I could be part of all of the different groups. The BioHackathons have traditionally had a strong focus on semantic web technologies, and there are a number of fantastic SPARQL endpoints here in Japan. My own group coalesced around the general idea of applying logic programming approaches to bioinformatics problems. The group includes Will Byrd (of miniKanren and The Reasoned Schemer fame), Pjotr Prins, Deepak UnniHirokazu Chiba, and Shuichi Kawashima.

During the symposium, I presented on the sparqlprog framework. The slides are here:

The basic idea is to use logic programming as an over-arching framework, encompassing RDF and SPARQL, but allowing for additional expressivity and power.

One of the basic ideas here is to allow you to write complex queries using meaningful n-ary predicates. For example, if we want to query for all human genes in a particular range on a particular chromosome, and get the mouse orthologs, then we should be able to write this in as high-level way as possible, for example like this:

feature_in_range(grch38:’X’,10000000,20000000, HumanGene),
has_mouse_ortholog(HumanGene, MouseGene)

“feature_in_range” and “has_mouse_ortholog” are logic predicates. Unlike RDF, logic programming predicates can have any number of arguments rather than two (which is why the above notation is used, rather than infix, which only works for binary). The bold font indicates variable names. This query is then translated to SPARQL, which is significantly more verbose:

SELECT ?g ?h WHERE {
?g sio:000558 ?h .
?h obo:RO_0002162 taxon:10090 .
?g a obo:SO_0001217 .
faldo:location [
faldo:begin [
faldo:position ?b ;
faldo:reference homo_sapiens/GRCh38/X ] ;
faldo:end [
faldo:position ?e ;
faldo:reference homo_sapiens/GRCh38/X ]]
FILTER (?b > 10000000) .
FILTER (?e < 20000000)
}

The two predicates in the query are defined using simple rules in a logic program. A rule consists of a ‘head’ predicate followed by the implication operator ‘:-‘ and a ‘body’ which specifies a list of conditions.

feature_in_range(Ref,MinBegin,MaxEnd,Feat) :-
location(Feat,Begin,End,Ref),
Begin >= MinBegin,
End =< MaxEnd.

location(Feat,Begin,End,Ref) :-
location(Feat,Loc),
begin(Loc,BeginPos),
position(BeginPos,Begin),reference(BeginPos,Ref),
end(Loc,EndPos),
position(EndPos,End),reference(EndPos,Ref).

Queries using defined predicates are recursively rewritten until we bottom out in binary RDF predicates or builtin functions.

This is nice as it adds composability to SPARQL, and frees the query author from repeating common patterns across multiple queries.

But the overall framework is more powerful as programs can be more expressive than SPARQL, for example, involving recursion or backtracking. Portions are executed in the local logic programming environment, and portions are executed remotely on a SPARQL endpoint.

SparqlProg-BH-2019.png

SPARQLProg execution environment. Clients can send queries and optionally a program (rules) to the SPARQLProg environment (using the pengines web logic protocol). Queries are by default compiled to SPARQL and executed remotely. Optionally, the program may seemlessly mix local and remote execution, with local execution allowing more expressivity.

SPARQLProg can be executed on the command line (examples here). It also runs as a service, and there is a docker container available, so all you need to do is:

docker run -p 9083:9083 cmungall/sparqlprog

An example of how to connect via Python can be found in some example Jupyter notebooks such as this one.

Reactions at the biohackathon seem to range from confusion to excitement. It’s fun to see people’s reactions when they ‘get it’. There seems to be a lot of enthusiasm from locals, with people contributing wrappers for KEGG and TogoVar, an integrated database of Japanese genomic variation.

Next up is a framework that will allow querying over specialized genome variant graph engines…

I am also working with Pier Luigi Buttigieg on ENVO. I recently developed a toolkit based on SPARQLProg for aligning an ontology to Wikidata. One of our goals is to upload GAZ (the OBO Gazetteer) into Wikidata, and align ENVO. This will allow us to extract ENVO classifications for all 600,000 entries in GAZ. The repo for this work can be found here.

More updates later, back to the hacking for now…

OntoTip: Don’t over-specify OWL definitions

This is one post in a series of tips on ontology development, see the parent post for more details.

A common mistake is to over-specify an OWL definition (another post will be on under-specification). While not technically wrong, over-specification loses you reasoning power, limiting your ability to auto-classify your ontology. Formally, what I mean by over-specifying here is: stating more conditions than is required for correct entailments

One manifestation of this anti-pattern is the over-specified genus. (this is where I disagree with Seppala et al on S3.1.1, use the genus proximus, see previous post). I will use a contrived example here, although there are many real examples. GO contains a class ‘Schwann cell differentiation’, with an OWL definition referencing ‘Schwann cell’ from the cell ontology (CL).  I consider the logical definition to be neither over- nor under- specified:

‘Schwann cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’

We also have a corresponding logical definition for the parent:

‘glial cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘glial cell’

The Cell Ontology (CL) contains the knowledge that Schwann cells are subtypes of glial cells, which allows us to infer that ‘Schwann cell differentiation’ is a subtype of ‘glial cell differentiation’. So far, so good (if you read the post on Normalization you should be nodding along). This definition does real work for us in the ontology: we infer the GO hierarchy based on the definition and classification of cells in CL. 

Now, imagine that in fact GO had an alternate OWL definition:

Schwann cell differentiation’ EquivalentTo ‘glial cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’

This is not wrong, but is far less useful. We want to be able to infer the glial cell parentage, rather than assert it. Asserting it violates DRY (the Don’t Repeat Yourself principle) as we implicitly repeat the assertion about Schwann cells being glial cells in GO (when in fact the primary assertion belongs in CL). If one day the community decides that in fact that Schwann cells are not glial but in fact neurons (OK, so this example is not so realistic…), then we have to change this in two places. Having to change things in two places is definitely a bad thing.

I have seen this kind of genus-overspecification in a number of different ontologies; this can be a side-effect of the harmful misapplication of the single-inheritance principle (see ‘Single inheritance considered harmful’, a previous post). This can also arise from tooling limitations: the NCIT neoplasm hierarchy has a number of examples of this due to the tool they originally used for authoring definitions.

Another related over-specification is too many differentiae, which drastically limits the work a reasoner and your logical axioms can do for you. As a hypothetical example, imagine that we have a named cell type ‘hippocampal interneuron’, conventionally defined and used in the (trivial) sense of any interneuron whose soma is located in a hippocampus. Now let’s imagine that single-cell transcriptomics has shown that these cells always express genes A, B and C (OK, there are may nuances with integrating ontologies with single-cell data but let’s make some simplifying assumptions for now)/

It may be tempting to write a logical definition:

‘hippocampal interneuron’ EquivalentTo

  • interneuron AND
  • has-soma-location SOME hippocampus AND
  • expresses some A AND
  • expresses some B AND
  • expresses some C

This is not wrong per se (at least in our hypothetical world where hippocampal neurons always express these), but the definition does less work for us. In particular, if we later include a cell type ‘hippocampus CA1 interneuron’ defined as any interneuron in the CA1 region of the hippocampus, we would like this to be classified under hippocampal neuron. However, this will not happen unless we redundantly state gene expression criteria for every class, violating DRY.

The correct thing to do here is to use what is sometimes called a ‘hidden General Class Inclusion (GCI) axiom’ which is just a fancy way of saying that SubClassOf (necessary conditions) can be mixed in with an equivalence axiom / logical definition:

‘hippocampal interneuron’ EquivalentTo interneuron AND has-soma-location SOME hippocampus

‘hippocampal interneuron’ SubClassOf expresses some A

‘hippocampal interneuron’ SubClassOf expresses some B

‘hippocampal interneuron’ SubClassOf expresses some C

In a later post, I will return to the concept of an axiom doing ‘work’, and provide a more formal definition that can be used to evaluate logical definitions. However, even without a formal metric, the concept of ‘work’ is intuitive to people who have experience using OWL logical definitions to derive hierarchies. These people usually intuitively test things in the reasoner as they go along, rather than simply writing an OWL definition and hoping it will work.

Another sign that you may be overstating logical definitions is if they are for groups of similar classes, yet they do not fit into any design pattern template.

For example, in the above examples, the cell differentiation branch of GO fits into a standard pattern

cell differentiation and results-in-acquisition-of-features-of some C

where C is any cell type. The over-specified definition does not fit this pattern.

 

 

 

Proposed strategy for semantics in RDF* and Property Graphs

Update 2020-09-12: I created a GitHub repo that concretizes part of the proposal here https://github.com/cmungall/owlstar

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z

Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)

proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>>  owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png

Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png

protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)

Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

OntoTip: Write simple, concise, clear, operational textual definitions

This is a post in a series of tips on ontology development, see the parent post for more details.

Ontologies contain both textual definitions (aimed primarily at humans) and logical definitions (aimed primarily at machines). There is broad agreement that textual definitions are highly important (they are an OBO principle), and the utility of logical definitions has been shown for both ontology creation/maintenance (see previous post) as well as for analytic applications. However, there has been insufficient attention paid to the crafting of definitions, and to addressing questions such as how textual and logical definitions inter-relate, leading to a lot of inconsistent practice across OBO ontologies. 

Mungalls-Ontology-Design-Guidelines (3)

text definitions are for consumption by biocurators and domain scientists, logical definitions for machines. Logical definition here shown in OWL Manchester syntax, with units written as human-readable labels in quotes. Note the correspondence between logical and textual definitions.

Two people who have thought deeply about this are Selja Seppälä and Alan Ruttenberg. They organized the  2016 International Workshop on Definitions in Ontologies (IWOOD 2016), and I will lift a quote directly from the website here:

Definitions of terms in ontologies serve a number of purposes. For example, logical definitions allow reasoners to assist in and verify classification, lessening the development burden and enabling expressive queries. Natural language (text) definitions allow humans to understand the meaning of classes, and can help ameliorate low inter-annotator agreement. Good definitions allow for non-experts and experts in adjacent disciplines to understand unfamiliar terms making it possible to confidently use terms from external ontologies, facilitating data integration. 

Despite the importance of definitions in ontologies, developers often have little if any training in writing definitions and axioms, as shown in Selja Seppälä and Alan Ruttenberg, Survey on defining practices in ontologies: Report, July 2013. This leads to varying definition practices and inconsistent definition quality. Worse, textual and logical definitions are often left out of ontologies altogether. 

I would also state that poorly constructed textual definitions can have severe long term ramifications. They can introduce cryptic ambiguities or misunderstandings that may not be uncovered for years, at which point they necessitate expensive ontology repair and re-curation efforts. My intent in this post is not to try and impose my own stylistic quirks on everyone else, but to improve the quality of engineering in ontologies, and to improve the lives of curators using definitions for their daily work.

There is an excellent follow-up paper Guidelines for writing definitions in ontologies by Seppälä, Smith, and Ruttenberg (henceforth referred to as the SRS paper), which should be required reading for anyone who is involved in building ontologies. The authors provide a series of guidelines based on their combined ontology development expertise and empirical work on surveying usage and attitudes.

While there is potentially an aspect of personal preference and stylistic preference in crafting text, I think that their guidelines are eminently sensible and deserve further exposure and adoption. I recommend reading the full paper. Here I will look at a subset of these, and give my own informal take on them. In their paper, SRS use a numbering system for their guidelines. I prefix their numbering system with S, and will go through them in a different order.

I have transcribed the guidelines to a table here, with the guidelines I discuss here in bold:

S1 Conform to conventions
S1.1 Harmonize definitions
S2 Principles of good practice
S3 Use the genus differentia form
S3.1 Include exactly one genus
S3.1.1 Use the genus proximus
S3.1.2 Avoid plurals
S3.1.3 Avoid conjunctions and disjunctions
S3.1.4 Avoid categorizers
S4 Avoid use/mention confusion
S5 Include necessary, and whenever possible, jointly sufficient conditions
S5.1 Avoid encyclopedia information
S5.2 Avoid negative terms
S5.3 Avoid definitions by extension
S6 Adjust the scope
S6.1 Definition should be neither too broad nor too narrow
S6.2 Define only one thing with a single textual definition
S7 Avoid circularity
S8 Include jointly satisfiable features
S9 Use appropriate degree of generality
S9.1 Avoid generalizing exprressions
S9.2 Avoid examples and lists
S9.3 Avoid indexical and dialectic terms
S9.4 Avoid subjective and evaluative statements
S10 Define abbreviations and acronyms
S11 Match text and logical definitions
S11.1 Proofread definitions

Concisely state necessary and sufficient conditions, cut the chit-chat

Cut_the_Crap

Listen to The Clash: cut the c**p

Combining S6.1 “A definition should be neither too broad nor too narrow” with S9.4 “avoid subjective and evaluative statements”, I would choose to emphasize that textual definitions should concisely encapsulate necessary and sufficient conditions, avoiding weasel words, irrelevant verbiage, chit-chat and random blethering. This makes it easier for a reader to hone in on the intended meaning of the class. It also encourages a standard style (S1), which can make it easier for others to write definitions when creating new classes. It also makes it easier to be consistent with the logical definition, when provided (S11; see below). 

SRS provide this example under S9.4:

cranberry bean: Also called shell bean or shellout, and known as borlotti bean in Italy, the cranberry bean has a large, knobby beige pod splotched with red. The beans inside are cream- colored with red streaks and have a delicious nutlike flavor. Cranberry beans must be shelled before cooking. Heat diminishes their beautiful red color. They’re available fresh in the summer and dried throughout the year (FOODON_03411186)

While this text contains potentially useful information, this is not a good operational definition, it lacks easy to apply objective criteria to determine what is and what is not a member of this class.

If you need to include discursive text, use either the definition gloss or a separate description field. The ‘gloss’ is the part of the text definition that comes after the first period/full-stop. A common practice in the GO is to recapitulate the definition of the differentia in the gloss. For example, the definition for ‘ectoderm development’ is

The process whose specific outcome is the progression of the ectoderm over time, from its formation to the mature structure. In animal embryos, the ectoderm is the outer germ layer of the embryo, formed during gastrulation.”.

(embedded ‘ectoderm’ definition underlined)

This suffers some problems as it violates DRY (if the wording of the definition of ectoderm changes, then the wording of the definition of ‘ectoderm development’ changes). However, it provides utility as users do not have to traverse the elements of the OWL definition to achieve the bigger picture. It is marginally easier to semi-automatically update the gloss, compared to the situation where the redundant information permeates the core text definition. 

When the conventions for a particular ontology allow for gloss, it is important to be consistent about how this is used, and to include only necessary and sufficient conditions before the period. Recently in GO we were puzzling over what was included and excluded in the following definition:

An apical plasma membrane part that forms a narrow enfolded luminal membrane channel, lined with numerous microvilli, that appears to extend into the cytoplasm of the cell. A specialized network of intracellular canaliculi is a characteristic feature of parietal cells of the gastric mucosa in vertebrates

It is not clear if parietal cells are included as an exemplar, or if this is intended as a necessary condition. S5.1 “avoid encyclopedic information” is excellent advice. This recommends putting examples of usage in a dedicated field. Unfortunately the practice of including examples in definitions is common because many curation tools limit which fields are shown, and examples can help curators immensely. I would therefore compromise on this advice and say that IF examples are to be included in the definition field, THEN this MUST be included in the gloss (after the necessary and sufficient conditions, separated by a period), AND it should be clearly indicated as an example. GO uses the string “An example of this {process,component,…} is found in …” to indicate an example.

Genus-differentia definitions are your friend

(S3)

Mungalls-Ontology-Design-Guidelines (4).png

Genus-differentia definitions are your friend.

In the introduction, SRS define a ‘classic definition’ as one following genus-differentia style i.e. “a G that D”. The precise lexical structure can be modified for readability, but the important part is to state differentiating characteristics from a generic superclass

The example in the paper is the Uberon definition of skeletal ligament: “Dense regular connective tissue connecting two or more adjacent skeletal elements”. Here the genus is “dense regular connective tissue” (which should be the name of a superclass in the ontology; not necessarily the direct parent post-reasoning) and the differentiating characteristics are property of “connecting two or more adjacent skeletal elements” (which is also expressed via relationships in the ontology). As it happens, this definition violates one of the other principles as we should say later.

I agree enthusiastically with S3 “Use the genus-differentia form”. (Note that this should not be confused with elevation of single-inheritance as desired property in released ontologies; see this post)

The genus-differentia definition should be both necessary (i.e. the genus and the characteristics hold for all instances of the class) and sufficient (i.e. anything that satisfies the genus and characteristics must be an instance of the class).

Genus-differentia definitions encourage modularity and reuse. We can construct an ontology in a modular fashion, reusing simpler concepts to fashion more complex concepts.

Genus-differentia form is an excellent way to ensure definitions are operational. The set of all genus-differentia definitions form a decision tree, we can work up or down the tree to determine if an observation falls into an ontology class.

I also agree with S3.1 “include exactly one genus”. SRS give the example in OBI of

recombinant vector: “A recombinant vector is created by a recombinant vector cloning process”

which omits a genus (it could be argued that a more serious issue is the practice of defining an object in terms of its creation process rather than vice versa).

In fact, omission of a genus is often observed in logical definitions too, and is usually the result of an error, and will give unintended results in reasoning. I chose the following example from CLO (reported here):

http://purl.obolibrary.org/obo/CLO_0000266 immortal uterine cervix-derived cell line cell

This is wrong because a reasoner will classify anything that comes from a cervix as being a cell line!

In a rare disagreement with SRS, I have a slight issue with S3.1.1 “use the genus proximus”, i.e. use the closest parent term, but I cover this in a future post. Using the closest parent can lead to redundancy and violations of DRY. 

Avoid indexicals (S9.3)

Quoting SRS’ wording for S9.3:

Avoid indexical and deictic terms, such as ‘today’, ‘here’, and ‘this’ when they refer to (the context of ) the author of the definition or the resource itself. Such expressions often indicate the presence of a non-defining feature or a case of use/mention confusion. Most of the times, the definition can be edited and rephrased in a more general way

Here is a bad disease definition for a fictional disease (adapted from a real example): “A recently discovered disease that affects the anterior diplodocus organ…”. Don’t write definitions like this. This is obviously bad as it will become outdated and your ontology will look sad. If the date of discovery is important, include an annotation assertion for date of discovery (or better yet, a field for originating publication, which entails a date). But it’s more likely this is unnecessary verbiage that detracts from the business of precisely communicating the meaning of the class (S9.4).

Conform to conventions (S1)

As well as following natural language conventions and conventions of the domain of the ontology, it’s good to follow conventions, if not across ontologies, at least within the same ontology.

Do not replicate the name of the class in the definition

An example is a hypothetical definition for ‘nucleus’

A nucleus is a membrane-bounded organelle that …

This violates DRY and is not robust to changes in the name. Under S1.1 this is stated as “limiting the definition to the definiens”, alternatively states as “avoid including the definiendum and copula”.  If you really must include the name (definiendum), do this consistently throughout the ontology rather than ad-hoc. But I strongly recommend not to, and to start the text of the definition with the string “A <genus> that …”.

Here is another bad made-up definition for a fictional disease (based on real examples):

Spattergroit (also known as contagious purple pustulitis) is a highly contagious disease caused by…”.

Including a synonym in the definition violates DRY, and will lead to inconsistency if the synonym becomes a class in its own right. Remember, we are not writing encyclopedic descriptions, but ontology definitions. Information such as synonyms can go in dedicated fields (where they can be used computationally, and presented appropriately to the user).

S11 Match Textual and Logical Definitions

The OWL definition (aka logical definition, aka equivalence axiom), when it exists, should correspond in some broad sense to the text definition. This does not mean that it should be a literal transcription of the OWL. On the contrary, you should always avoid strange computerese conventions in text intended for humans (this includes the use of IDs in text, connecting_words_with_underscoresOrCamelCase, use of odd characters, as well as strange unwieldy grammatical forms; see S1). It does mean that if your OWL definition veers wildly from your text then you have a bad smell you need to get rid of before visitors come around.

If your OWL definition doesn’t match your text definition, it is often a sign you are writing overly clever complex Boolean logic OWL definitions that don’t correspond to how domain scientists think about the class [covered in a future post]. Or maybe you are over-axiomatizing, and you should drop your equivalence axiom since on examination it’s not right (see the over-axiomatizing principle).

SRS provide one positive example, but no negative examples. The positive example is from IDO:

Screen Shot 2019-07-06 at 1.50.53 PM.png

Positive example from IDO: bacteremia: An infection that has as part bacteria located in the blood. Matches the logical def of infection and (has_part some
(infectious agent and Bacteria and (located_in some blood)))

Unfortunately, there are many cases where text and logical definitions deviate. An example reported for OBI is oral administration:

The administration of a substance into the mouth of an organism”

the text def above is considerably different from the logical one:

EquivalentTo (realizes some material to be added role) and (realizes some (target of material addition role and (role of some mouth)))

Use of DOSDPs can help here, as a standard textual definition form here can be generated for classes with OWL definitions. One thing that would be useful would be a tool that could help spot cases where the text definition and logical definition have veered widely.

Summary

I was able to write this post by cribbing from the SRS paper (Seppala et al) which I strongly recommend reading. Even if you don’t agree with everything in either the paper or my own take, I think it’s important if the ontology community discuss some of these and reach some kind of consensus on which principles to apply when.

Of course, there will always be an element of subjectivity and stylistic preference that will be harder to agree on. When making recommendations here there is the danger of being perceived as the ‘ontology police’. But I think there is a core set of common-sense principles that help with making ontologies more usable, consistent, and maintainable. My own experience strongly suggests that when this advice is not heeded, we end up with costly misannotation due to differing interpretations of terms, and many other issues.

I would like OBO to play more of a role in the process of coming up with these guidelines, and on evaluating their usage in existing ontologies. Stay tuned for more on this, and please provide feedback on what you think!