OntoTip: Write simple, concise, clear, operational textual definitions

This is a post in a series of tips on ontology development, see the parent post for more details.

Ontologies contain both textual definitions (aimed primarily at humans) and logical definitions (aimed primarily at machines). There is broad agreement that textual definitions are highly important (they are an OBO principle), and the utility of logical definitions has been shown for both ontology creation/maintenance (see previous post) as well as for analytic applications. However, there has been insufficient attention paid to the crafting of definitions, and to addressing questions such as how textual and logical definitions inter-relate, leading to a lot of inconsistent practice across OBO ontologies. 

Mungalls-Ontology-Design-Guidelines (3)

text definitions are for consumption by biocurators and domain scientists, logical definitions for machines. Logical definition here shown in OWL Manchester syntax, with units written as human-readable labels in quotes. Note the correspondence between logical and textual definitions.

Two people who have thought deeply about this are Selja Seppälä and Alan Ruttenberg. They organized the  2016 International Workshop on Definitions in Ontologies (IWOOD 2016), and I will lift a quote directly from the website here:

Definitions of terms in ontologies serve a number of purposes. For example, logical definitions allow reasoners to assist in and verify classification, lessening the development burden and enabling expressive queries. Natural language (text) definitions allow humans to understand the meaning of classes, and can help ameliorate low inter-annotator agreement. Good definitions allow for non-experts and experts in adjacent disciplines to understand unfamiliar terms making it possible to confidently use terms from external ontologies, facilitating data integration. 

Despite the importance of definitions in ontologies, developers often have little if any training in writing definitions and axioms, as shown in Selja Seppälä and Alan Ruttenberg, Survey on defining practices in ontologies: Report, July 2013. This leads to varying definition practices and inconsistent definition quality. Worse, textual and logical definitions are often left out of ontologies altogether. 

I would also state that poorly constructed textual definitions can have severe long term ramifications. They can introduce cryptic ambiguities or misunderstandings that may not be uncovered for years, at which point they necessitate expensive ontology repair and re-curation efforts. My intent in this post is not to try and impose my own stylistic quirks on everyone else, but to improve the quality of engineering in ontologies, and to improve the lives of curators using definitions for their daily work.

There is an excellent follow-up paper Guidelines for writing definitions in ontologies by Seppälä, Smith, and Ruttenberg (henceforth referred to as the SRS paper), which should be required reading for anyone who is involved in building ontologies. The authors provide a series of guidelines based on their combined ontology development expertise and empirical work on surveying usage and attitudes.

While there is potentially an aspect of personal preference and stylistic preference in crafting text, I think that their guidelines are eminently sensible and deserve further exposure and adoption. I recommend reading the full paper. Here I will look at a subset of these, and give my own informal take on them. In their paper, SRS use a numbering system for their guidelines. I prefix their numbering system with S, and will go through them in a different order.

I have transcribed the guidelines to a table here, with the guidelines I discuss here in bold:

S1 Conform to conventions
S1.1 Harmonize definitions
S2 Principles of good practice
S3 Use the genus differentia form
S3.1 Include exactly one genus
S3.1.1 Use the genus proximus
S3.1.2 Avoid plurals
S3.1.3 Avoid conjunctions and disjunctions
S3.1.4 Avoid categorizers
S4 Avoid use/mention confusion
S5 Include necessary, and whenever possible, jointly sufficient conditions
S5.1 Avoid encyclopedia information
S5.2 Avoid negative terms
S5.3 Avoid definitions by extension
S6 Adjust the scope
S6.1 Definition should be neither too broad nor too narrow
S6.2 Define only one thing with a single textual definition
S7 Avoid circularity
S8 Include jointly satisfiable features
S9 Use appropriate degree of generality
S9.1 Avoid generalizing exprressions
S9.2 Avoid examples and lists
S9.3 Avoid indexical and dialectic terms
S9.4 Avoid subjective and evaluative statements
S10 Define abbreviations and acronyms
S11 Match text and logical definitions
S11.1 Proofread definitions

Concisely state necessary and sufficient conditions, cut the chit-chat

Cut_the_Crap

Listen to The Clash: cut the c**p

Combining S6.1 “A definition should be neither too broad nor too narrow” with S9.4 “avoid subjective and evaluative statements”, I would choose to emphasize that textual definitions should concisely encapsulate necessary and sufficient conditions, avoiding weasel words, irrelevant verbiage, chit-chat and random blethering. This makes it easier for a reader to hone in on the intended meaning of the class. It also encourages a standard style (S1), which can make it easier for others to write definitions when creating new classes. It also makes it easier to be consistent with the logical definition, when provided (S11; see below). 

SRS provide this example under S9.4:

cranberry bean: Also called shell bean or shellout, and known as borlotti bean in Italy, the cranberry bean has a large, knobby beige pod splotched with red. The beans inside are cream- colored with red streaks and have a delicious nutlike flavor. Cranberry beans must be shelled before cooking. Heat diminishes their beautiful red color. They’re available fresh in the summer and dried throughout the year (FOODON_03411186)

While this text contains potentially useful information, this is not a good operational definition, it lacks easy to apply objective criteria to determine what is and what is not a member of this class.

If you need to include discursive text, use either the definition gloss or a separate description field. The ‘gloss’ is the part of the text definition that comes after the first period/full-stop. A common practice in the GO is to recapitulate the definition of the differentia in the gloss. For example, the definition for ‘ectoderm development’ is

The process whose specific outcome is the progression of the ectoderm over time, from its formation to the mature structure. In animal embryos, the ectoderm is the outer germ layer of the embryo, formed during gastrulation.”.

(embedded ‘ectoderm’ definition underlined)

This suffers some problems as it violates DRY (if the wording of the definition of ectoderm changes, then the wording of the definition of ‘ectoderm development’ changes). However, it provides utility as users do not have to traverse the elements of the OWL definition to achieve the bigger picture. It is marginally easier to semi-automatically update the gloss, compared to the situation where the redundant information permeates the core text definition. 

When the conventions for a particular ontology allow for gloss, it is important to be consistent about how this is used, and to include only necessary and sufficient conditions before the period. Recently in GO we were puzzling over what was included and excluded in the following definition:

An apical plasma membrane part that forms a narrow enfolded luminal membrane channel, lined with numerous microvilli, that appears to extend into the cytoplasm of the cell. A specialized network of intracellular canaliculi is a characteristic feature of parietal cells of the gastric mucosa in vertebrates

It is not clear if parietal cells are included as an exemplar, or if this is intended as a necessary condition. S5.1 “avoid encyclopedic information” is excellent advice. This recommends putting examples of usage in a dedicated field. Unfortunately the practice of including examples in definitions is common because many curation tools limit which fields are shown, and examples can help curators immensely. I would therefore compromise on this advice and say that IF examples are to be included in the definition field, THEN this MUST be included in the gloss (after the necessary and sufficient conditions, separated by a period), AND it should be clearly indicated as an example. GO uses the string “An example of this {process,component,…} is found in …” to indicate an example.

Genus-differentia definitions are your friend

(S3)

Mungalls-Ontology-Design-Guidelines (4).png

Genus-differentia definitions are your friend.

In the introduction, SRS define a ‘classic definition’ as one following genus-differentia style i.e. “a G that D”. The precise lexical structure can be modified for readability, but the important part is to state differentiating characteristics from a generic superclass

The example in the paper is the Uberon definition of skeletal ligament: “Dense regular connective tissue connecting two or more adjacent skeletal elements”. Here the genus is “dense regular connective tissue” (which should be the name of a superclass in the ontology; not necessarily the direct parent post-reasoning) and the differentiating characteristics are property of “connecting two or more adjacent skeletal elements” (which is also expressed via relationships in the ontology). As it happens, this definition violates one of the other principles as we should say later.

I agree enthusiastically with S3 “Use the genus-differentia form”. (Note that this should not be confused with elevation of single-inheritance as desired property in released ontologies; see this post)

The genus-differentia definition should be both necessary (i.e. the genus and the characteristics hold for all instances of the class) and sufficient (i.e. anything that satisfies the genus and characteristics must be an instance of the class).

Genus-differentia definitions encourage modularity and reuse. We can construct an ontology in a modular fashion, reusing simpler concepts to fashion more complex concepts.

Genus-differentia form is an excellent way to ensure definitions are operational. The set of all genus-differentia definitions form a decision tree, we can work up or down the tree to determine if an observation falls into an ontology class.

I also agree with S3.1 “include exactly one genus”. SRS give the example in OBI of

recombinant vector: “A recombinant vector is created by a recombinant vector cloning process”

which omits a genus (it could be argued that a more serious issue is the practice of defining an object in terms of its creation process rather than vice versa).

In fact, omission of a genus is often observed in logical definitions too, and is usually the result of an error, and will give unintended results in reasoning. I chose the following example from CLO (reported here):

http://purl.obolibrary.org/obo/CLO_0000266 immortal uterine cervix-derived cell line cell

This is wrong because a reasoner will classify anything that comes from a cervix as being a cell line!

In a rare disagreement with SRS, I have a slight issue with S3.1.1 “use the genus proximus”, i.e. use the closest parent term, but I cover this in a future post. Using the closest parent can lead to redundancy and violations of DRY. 

Avoid indexicals (S9.3)

Quoting SRS’ wording for S9.3:

Avoid indexical and deictic terms, such as ‘today’, ‘here’, and ‘this’ when they refer to (the context of ) the author of the definition or the resource itself. Such expressions often indicate the presence of a non-defining feature or a case of use/mention confusion. Most of the times, the definition can be edited and rephrased in a more general way

Here is a bad disease definition for a fictional disease (adapted from a real example): “A recently discovered disease that affects the anterior diplodocus organ…”. Don’t write definitions like this. This is obviously bad as it will become outdated and your ontology will look sad. If the date of discovery is important, include an annotation assertion for date of discovery (or better yet, a field for originating publication, which entails a date). But it’s more likely this is unnecessary verbiage that detracts from the business of precisely communicating the meaning of the class (S9.4).

Conform to conventions (S1)

As well as following natural language conventions and conventions of the domain of the ontology, it’s good to follow conventions, if not across ontologies, at least within the same ontology.

Do not replicate the name of the class in the definition

An example is a hypothetical definition for ‘nucleus’

A nucleus is a membrane-bounded organelle that …

This violates DRY and is not robust to changes in the name. Under S1.1 this is stated as “limiting the definition to the definiens”, alternatively states as “avoid including the definiendum and copula”.  If you really must include the name (definiendum), do this consistently throughout the ontology rather than ad-hoc. But I strongly recommend not to, and to start the text of the definition with the string “A <genus> that …”.

Here is another bad made-up definition for a fictional disease (based on real examples):

Spattergroit (also known as contagious purple pustulitis) is a highly contagious disease caused by…”.

Including a synonym in the definition violates DRY, and will lead to inconsistency if the synonym becomes a class in its own right. Remember, we are not writing encyclopedic descriptions, but ontology definitions. Information such as synonyms can go in dedicated fields (where they can be used computationally, and presented appropriately to the user).

S11 Match Textual and Logical Definitions

The OWL definition (aka logical definition, aka equivalence axiom), when it exists, should correspond in some broad sense to the text definition. This does not mean that it should be a literal transcription of the OWL. On the contrary, you should always avoid strange computerese conventions in text intended for humans (this includes the use of IDs in text, connecting_words_with_underscoresOrCamelCase, use of odd characters, as well as strange unwieldy grammatical forms; see S1). It does mean that if your OWL definition veers wildly from your text then you have a bad smell you need to get rid of before visitors come around.

If your OWL definition doesn’t match your text definition, it is often a sign you are writing overly clever complex Boolean logic OWL definitions that don’t correspond to how domain scientists think about the class [covered in a future post]. Or maybe you are over-axiomatizing, and you should drop your equivalence axiom since on examination it’s not right (see the over-axiomatizing principle).

SRS provide one positive example, but no negative examples. The positive example is from IDO:

Screen Shot 2019-07-06 at 1.50.53 PM.png

Positive example from IDO: bacteremia: An infection that has as part bacteria located in the blood. Matches the logical def of infection and (has_part some
(infectious agent and Bacteria and (located_in some blood)))

Unfortunately, there are many cases where text and logical definitions deviate. An example reported for OBI is oral administration:

The administration of a substance into the mouth of an organism”

the text def above is considerably different from the logical one:

EquivalentTo (realizes some material to be added role) and (realizes some (target of material addition role and (role of some mouth)))

Use of DOSDPs can help here, as a standard textual definition form here can be generated for classes with OWL definitions. One thing that would be useful would be a tool that could help spot cases where the text definition and logical definition have veered widely.

Summary

I was able to write this post by cribbing from the SRS paper (Seppala et al) which I strongly recommend reading. Even if you don’t agree with everything in either the paper or my own take, I think it’s important if the ontology community discuss some of these and reach some kind of consensus on which principles to apply when.

Of course, there will always be an element of subjectivity and stylistic preference that will be harder to agree on. When making recommendations here there is the danger of being perceived as the ‘ontology police’. But I think there is a core set of common-sense principles that help with making ontologies more usable, consistent, and maintainable. My own experience strongly suggests that when this advice is not heeded, we end up with costly misannotation due to differing interpretations of terms, and many other issues.

I would like OBO to play more of a role in the process of coming up with these guidelines, and on evaluating their usage in existing ontologies. Stay tuned for more on this, and please provide feedback on what you think!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: