Ontologies are ways of organizing collections of concepts that can be used to annotate data. Ontologies are complemented by schemas, data models, reporting formats and standards, which describe how data should be be structured.
Some examples:
- The Genome Feature Format (GFF) for describing regions of a genome is complemented by the Sequence Ontology (SO), as well as additional ontologies used for describing attributes of that feature.
- The Genomics Standards Consortium (GSC) Minimal Information about any Sequence (MIxS) Standard refers to at least 14 ontologies, including the Environment Ontology (ENVO) for describing environmental context of a standard and Uberon
- The MAIRR standard from the Adaptive Immune Receptor Repertoire (AIRR) community makes use of over a dozen ontologies (see schema on github), including Uberon and the NCBI Taxonomy
- The CZI cellxgene standard makes use of the Cell Ontology for single cell data.
- Neurodata Without Borders (NWB) is a standard for neuroelectrophsyiology data with a need for ontologies to provide context and metadata for experiments, organisms, measurements, and units.
- Phenopackets is an ISO-approved standard from the GA4GH for describing patients, and makes use of multiple ontologies, primarily HPO for phenotypes but also others for representing cancer staging, drugs and treatments, body sites
As an example how ontologies are used, the following figure illustrates the schema used for the Data Coordination Center for the ENCODE project (taken from Malladi et al https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4360730/)

You can get a sense for which ontologies are used for which standards on the FAIRsharing site – for example for Uberon:

Despite this ubiquity, most ontologies are bound to standards and schemas in a very loose way. Usually there is accompanying free text instructing humans to do something like “use branch B of ontology O for filling in this field”, but these instructions are open to interpretation and don’t avail themselves to automated computational checking, or for driving data entry forms.
Furthermore, many of these ontologies are huge and selecting the right term can be a challenge for whoever is providing the metadata – a scientist submitting the data, a data wrangling providing first-pass curation, or a biocurator at a knowldge base. There is a missed opportunity to provide selected subsets of the ontology that are relevant to a particular use case. For example, when providing environmental context for a terrestrial soil sample it’s not necessary to use all three thousand terms in ENVO, terms for marine biomes or terms that describe the insides of buildings or humans are not relevant.
And in fact many schema languages like JSON-Schema lack the ability to bind ontologies to field values. There is a need for this demonstrated by this discussion on the JSON-Schema discussion group. Many groups (Human Cell Atlas, AIRR) have developed their own bespoke extensions to JSON schema or related formalisms like Open API that allow the schema designer to specify a query for ontology terms that is then executed against a service like the Ontology Lookup Service. Here is an example from AIRR:
species:
$ref: '#/Ontology'
nullable: false
description: Binomial designation of subject's species
title: Organism
example:
id: NCBITAXON:9606
label: Homo sapiens
x-airr:
miairr: essential
adc-query-support: true
set: 1
subset: subject
name: Organism
format: ontology
ontology:
draft: false
top_node:
id: NCBITAXON:7776
label: Gnathostomata
However, none of these JSON-Schema extensions are official, and from the discussion it seems this is unlikely to happen soon. JSON-Schema does offer enums, a common construct in data modeling, which allows field values to be constrained to a simple flat dropdown of strings. Standard enums are not FAIR, because they offer no way to map these strings to standard terms from ontologies.
The clinical informatics community has dealt with the problem of combining ontologies and terminologies with data models for some time, and after various iterations of HL7, the FHIR (Fast Healthcare Interoperability Resources) provides a way to do this via Value Sets. However, the FHIR solution is tightly bound with the FHIR data model which is not always appropriate for modeling non-healthcare data.
Using Ontologies with LinkML
LinkML (Linked Data Modeling Language) is a polymorphic pluralistic data modeling language designed to work in concert with and yet extend and add semantics to existing data modeling solutions such as JSON-Schema and SQL DDL. LinkML provides a simple standard way to describe schemas, data models, reporting standards, data dictionaries, and to optionally adorn these with semantics via IRIs from standard vocabularies and ontologies, ranging from schema.org through to rich OBO ontologies.
From the outset, LinkML has supported the ability to provide “annotated enums” (also known as Value Sets), extending the semantics of enums of JSON-Schema, SQL, and object oriented languages.
For example, the following enum illustrates a set of hardcoded Permissible Values mapped to to terms from the GA4GH pedigree standard kinship ontology
enums:
FamilialRelationshipType:
permissible_values:
SIBLING OF:
description: A family relationship where the two members have a parent on common
meaning: kin:KIN_007
PARENT OF:
description: A family relationship between offspring and their parent
meaning: kin:KIN_003
CHILD OF:
description: inverse of the PARENT_OF relationship
meaning: kin:KIN_002
Each permissible value is optionally annotated with a “meaning” tag, which has a CURIE denoting a term from a vocabulary or external resource. Each permissible value can also be adorned with a description, as well as other metadata.
The use of ontology mappings here gives us an interoperability hook – two different schemas can interoperate via the use of shared standard terms, even if they want to present strings that are familiar to a local community.
This all works well for relatively small value sets. But what happens if I want to have a field in my schema whose value can be populated by any term from the Eukaroyte branch of NCBI Taxonomy? It’s not really practical to include a ginormous list of IDs and terms in a schema – especially if it’s liable to change.
Dynamic Enums
Starting with the new LinkML 1.3 model release today, it is possible to describe dynamic value sets in your schema. Rather than hardcoding a list of terms, you instead provide a declarative statement saying how to construct the value set, allowing binding to terms to happen later.
Let’s say you want a field in your standard to be populated with any of the subtypes of “neuron” in the cell ontology, you could do it like this:
enums:
NeuronTypeEnum:
reachable_from:
source_ontology: obo:cl
source_nodes:
- CL:0000540 ## neuron
include_self: false
relationship_types:
- rdfs:subClassOf
This value set is defined in terms of an ontology graph query, any term that is reachable from the node representing the base neuron class, walking down subClassOf edges.
Dynamic enums can also be composed together, either by nesting boolean expressions to arbitrary depth, or by naming sub-patterns and reusing them.
A more complex example involves the use of two enums, the first a general one for any kind of disease, and the second one both extends and restricts this, extending it to include cancer terms from a different vocabulary, and restricting to exclude non-human animal diseases:
enums:
Disease:
reachable_from:
source_ontology: obo:mondo
source_nodes:
- MONDO:0000001 ## disease or disorder
is_direct: false
relationship_types:
- rdfs:subClassOf
HumanDisease:
description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases
inherits:
- Disease
include:
- reachable_from:
source_ontology: obo:ncit
source_nodes:
- NCIT:C3262
minus:
- reachable_from:
source_ontology: obo:mondo
source_nodes:
- MONDO:0005583 ## non-human animal disease
relationship_types:
- rdfs:subClassOf
Tooling support
There is already support for LinkML static enums. When LinkML is used to generate JSON-Schema or SQL DDL then these are mapped to the corresponding constructs (but with loss of metadata).
LinkML-native tools such as Data Harmonizer support enums as hierarchical drop-downs in data entry.
For the newer dynamic enums, currently the focus is on standardizing how to specify these in a standard way, with expectations of tooling support to come soon.
This is something that is difficult to have a one-size fits all solution, due to the variety of use cases here. Consider the task of validating a schema with dynamic enums. In some cases, you may want to do this at run time, with a dynamic query against a remove ontology server. In other cases you might prefer not avoid network dependency at validation time, instead opting for either local ontology lookup, or a pre-materialization of the value set. One way to do this is to add in ontology lookups at time of compiling a JSON-Schema.
For the ontology lookups there are a variety of options. Bioportal includes almost a thousand ontologies covering most terminological uses in the life sciences, with ontologies in the OntoPortal alliance covering other domains like agriculture, ecology, and materials science. OLS also has a large number of ontologies also in the life sciences. But these ontology browsers don’t cover all entities, such as all proteins in UniProt. And for some use cases it may be necessary to use bespoke vocabularies. Tooling support for dynamic enums should cover all these scenarios.
This is something that could be supplied using the new Ontology Access Kit (OAK) library which provides a unifying layer over multiple different ontology sources, including hosted ontology portals like Bioportal, OLS, and Ontobee, local files in a variety of formats, local and remote SPARQL endpoints, including both biological databases, as well as Wikidata and Linked Open Vocabularies.
Trying it out
Head over to the LinkML documentation for information and tutorials on how to use LinkML and LinkML enums. Also check out schemasheets, an easy way to author schemas/standards, including use of enums, all using google sheets (or Excel, if you must).
Whether you opt to use LinkML or not, if you are involved in the creation of standards or data models, or if you are involved in the provision of ontologies that are used in these standards, I hope this post is useful for you in thinking about how to structure things in a way that makes these standards more rigorous, easier to use by scientists, and easier to systematically validate and automatically drive interfaces!