Ontologies, knowledge models, and other kinds of standards are generally not static artefacts. They are created to serve a community, which likely includes you, and they should respond dynamically to serve that community, where resources allow.
The content of many ontologies in OBO such as the Gene Ontology are driven by their respective curator communities. Ontology editors make terms and whole ontology branches prospectively in anticipation of needs, but they also make terms and changes in response to curator needs. New terms can also be requested by data modelers, data engineers, and data scientists, for example, to map categorical data in a dataset in order to allow harmonization and cleaning of data. In many cases, the terms may exist already, they may be hard to find, or they may be spread across a distribution of ontologies, and this may be confusing, especially if you are not familiar with the area.
I wrote this guide primarily with the data engineer audience in mind, as this community may be less familiar with norms and tacit knowledge around the ontology development lifecycle. However, much of what I say is applicable to curators as well. I also wrote this document originally for members of my group, who are expected to contribute back to OBO and the work of ontology developers. Some of the recommendations may therefore seem a little onerous in places, but they should still hopefully be useful and adaptable to a broader audience. And in all cases, for open community ontologies, it is worth bearing in mind the maxim that ‘you get out what you put in’.
- As a curator (e.g BgeeDb), I want a new anatomy term, to curate expression for a gene
- As a data scientist, I need terms describing oxygen-requirement traits, so I can combine tabular microbial trait data to predict traits from other features
- Alternate scenario: building a microbial Knowledge Graph (see this issue)
- As a database developer, I need terms for sequence variants, so I can map categorical values in my database making it FAIR
- As a knowledge graph builder, I need relations from the Relation Ontology or Biolink, so I can standardize edge labels in my graph
- As a GO ontology developer, I need terms from the cell ontology, so I can provide logical defining axioms for terms in the cell differentiation branch
- As a microbiome scientist, I need terms from ENVO/PO, so I can fill in MIxS-compliant environmental fields when I submit my sample data, making it FAIR
- As an environmental genomics standards provider, I need terms from ENVO, so I can map enumerated values/dropdowns to an ontology when developing the MIxS standard
- As a data modeler / standards provider, I need SO terms for genomic feature types, to define a value set for a genomics exchange format (e.g. GFF)
- As a schema developer, I need terms for describing properties of sequence assemblies, e.g. number of aligned reads, N50, in order to make my sequencing schema FAIR
These scenarios encompass a range of different kinds of person with varying levels of expertise and commitment. The primary audience for this document is members of my group and the projects we are involved with (GO, Monarch, NMDC, Translator) but many of the recommendations will apply more broadly. But we would not expect the average scientist who is submitting a dataset to engage at the same level through GitHub etc (see the end of the document for discussion on approaches for making the overall process easier).
20ish simple rules for selecting and requesting ontology terms
Be a good open science citizen
The work you are doing is part of a larger open science project, and you should have a community minded attribute.
When you request terms from ontologies and you provide information to help you should be microcredited, e.g. your orcid will be associated with the term. Remember that many efforts are voluntary or unfunded, and people are not necessarily paid to help you. Provide help where you can, provide context when making requests, and any background explicit or tacit knowledge that may help.
Always be respectful and appreciative when interacting with providers of terms or other curators. Follow codes of conduct.
Use the appropriate ontology or standard: avoid pick-and-mix
Depending on the context of your project, there may be mandated or recommended standards or ontologies. These may be explicit or implicit.
If you are performing curation, it is likely that the ontology you use is fixed by your curation best practices and even your tools. For example, GO curators (obviously) use GO. But for other purposes it may not be obvious which ontology to use (and even with GO, curators have a choice of ontologies for providing additional context as extensions or in GO-CAMs). There are a large number of them, with confusing overlaps, and lots of tacit community knowledge that may not be immediately available to you.
Some general guidelines:
- Look in the appropriate place, depending on what kind of term you need
- If you need a classification term or a descriptor, then use an ontology
- If you need something like a gene or a variant “term” then an ontology may not be appropriate, use the appropriate database instead, with caveats
- If you need a property to describe a piece of data, then you may need to look in existing semantics schemas, e.g. schemas encoded in RDFS, a shape language such as ShEx, or LinkML
- When looking for an ontology term, favor OBO over non-OBO resources
- Sometimes better coverage is only available outside OBO – e.g. EDAM has a lot more terms for describing bioinformatics software artefacts, and EDAM is not in OBO. But it is still good to engage owners of the appropriate OBO ontology
- When a non-OBO ontology is selected, use the OBO principles and guidelines to help evaluation – e.g is the ontology open? Does it follow good identifier lifecycle management?
- The OLS uses a broader selection of ontologies that is narrower than what is in Bioportal. In my experience the non-OBO ontologies they include are quite pragmatic choices in many situations, e.g EDAM.
- Even within OBO, there may still be confusion as to which ontologies to use, especially when many seem to have overlapping concepts, and scope may be poorly defined.
- An example is the term used to describe an organism’s core metabolism for our microbial knowledge graph KG-Microbe, with multiple OBO contenders.
- Always consult http://obofoundry.org to glean information about the ontology. This is always the canonical unbiased source, and includes curated up-to-date metadata
- A crucial piece of metadata that is in OBO is the ontology status – you must avoid using ontologies that are obsoleted, and you should avoid using ontologies that are marked inactive
- Look at the ‘usages’ field in OBO. Has the ontology been used for similar purposes as what you intend? If the ontology has no usages, this is a worrying sign the ontology was made for ontologists rather than practical data annotators such as yourself (but note that some ontologies may be behind in curating their usages into OBO)
- Look at the scope of the ontology, as defined on the OBO page. Is it well defined and clear? If not, consider avoiding. Is your term in scope for this ontology? If not then don’t use terms from the ontology just because the labels match.
- Is the ontology an application ontology, i.e. an ontology that is not intended to be a reference for terms within a domain? If so it may not be fit for your intended use.
- Consult others if in doubt. Many people in the group or in our funded projects are involved with specific ontologies.
- You should be on the OBO slack, this is a good place to get advice.
- Favor ontologies we are actively involved with or that follow similar data models and principles
- Favor more active ontologies. OBO marks inactive ontologies with metadata tags that are clearly displayed, but you should still check
- Is the github project active?
- Are there many tickets that are never answered?
- If you suspect an ontology is not active but it is not marked as such, be a good citizen and raise this on the OBO tracker
- Use precedence – see what has been done previously in similar projects
- We are actively working on projects like OBO Dashboard and on improving OBO metadata to help ontology selection
- For any candidate term
- Is it obsoleted? If so avoid, but look at the metadata for replacements
- Does it have a definition? A core OBO principle is that reasonable attempts must be made to define terms in an ontology
- Is there a taxonomic scope? Always use the appropriate taxonomic scope. If an uberon term is restricted to vertebrates, it is valid to use for humans. But if an ontology or term is designed for use with mouse, it may not be valid to apply terms for humans
- Have others used the term?
- If you have formal ontology training, avoid over-ontologizing in your thought processes for selection. See for example the section below on shadow terms.
- Avoid terms that seem over-ontologized; e.g. that have strange labels a domain scientist would not understand
- If you are looking for terms to categorize nodes or edges in a Knowledge Graph:
- For most of our projects, KGs should conform to the biolink-model, so this is the appropriate place to search
- Note that biolink still leverages OBO and standard bioinformatics databases for the nodes themselves; biolink classes and predicates are used for the node categories and edge labels
- For environmental samples use GSC MIxS terms for column headers
- Use ENVO for describing the environment
Figure: The OBO site provides up to date metadata for its ontologies. An example of an ontology marked deprecated, with the suggested replacement. Note in this case this ontology was not deprecated due to quality issues, instead the developers worked with a different ontology to incorporate their work, and provided new IDs for all their existing classes.
The biolink model will serve as a canonical guide for what kinds of IDs should be used for any kind of entity. The SOP is to find the category of relevance in biolink, and then examine the id_prefixes field. This indicates the resources that provide identifiers that are valid to use for that entity type, in priority order.
For example, for BiologicalProcess you will see on the page and in the yaml
– KEGG.MODULE ## M number
Figure: portion of Biolink page for the data class BiologicalProcess. The favored ID prefixes are shown
This means that GO is always our favored ontology / ID space for representing a biological process. This followed by reactome, then metacyc, then kegg. Of course, GO and Reactome serve different purposes, with reactome pathway IDs classified using GO IDs. If you disagree with this ordering you can make a PR on biolink (or you can also make a project-specific extensions/contraction of biolink).
Avoid a pick-and-mix approach. It is better to draw like terms from the same ontology, this ensures overall coherence, and allows reasoning to be better leveraged.
If you are creating a LinkML enum, a good rule of thumb is that all ‘meaning’ annotations should come from the same ontology. Of course, this may not always be the case.
For example, the enum for sex_chromosome_type in chromo is all drawn from GO:
what type of sex chromosome
Similarly for the gp2term relationship field in GPAD, these are all drawn from RO:
(note that part_of, BFO:0000050, is actually in RO, not BFO, despite the ID space)
However, entity type is drawn from SO, GO, and PR.
It is a bad smell to have a mix of different ontologies for what should be a set of similar entities, e.g
## TODO: the mappings below are automated
description: strictly anaerobic
description: obligate aerobic
description: obligate anaerobic
Some ontologies are themselves ad-hoc in their scoping, which can make it harder to determine which ontologies to go to find terms or request terms. Always favor ontologies with clear scope. We are actively working to fix scope problems in OBO:
Avoid shadow terms
Many ontologies mint “shadow concepts”. For example OBA may have the core concept of “blood pressure”. Another ontology many have a random mix of “datum” or “measurement” classes, e.g. “blood pressure datum”. Avoid these terms. Even if you want to describe a blood pressure measurement, just use the core concept. The fact that the concept is deployed in the context of a measurement should be communicated externally, e.g. in the data model you are using, not by precoordination.
By using the core concept you increase the overall coherency and connectivity of the information you are describing. Many shadow terms are in ‘application ontologies’ and are not properly linked to the core concepts.
Note that my own recommendations may not be aligned with the broader OBO community – see this ticket for further discussion.
Exercise due diligence in looking for the terms
Make sure the concept you need is definitely not present in the ontology before requesting.
Learn how to use ontology search tools appropriately. I recommend:
- OLS for OBO and selected other ontologies
- Bioportal for searching the broadest set of ontologies
Bioportal has the broadest collection (including all of OBO), but there is less of a filter. Ontologies may not be open. However, being in OBO is not a guarantee of quality, and there may be good reasons to use a non OBO ontology.
Expert ontologists may like to use Ontobee, but there are many things to be aware of before using it:
- The update frequency is less than OLS
- It does not display the partonomy, which is crucial for understanding many of the ontologies we work on
- Overall it presents a more ‘close to the base metal’ OWL model. This is fine for ontologists, but it is better not to point biologists here
If you are not experienced with ontologies and in particular OBO, there are many things that potentially trip you up. Don’t be afraid to ask about these — many people in the same shoes have you have been confused.
Potential confusion point: Some ontologies import other ontologies, or parts of other ontologies
This means e.g. if you are searching for a chemical element like ‘nitrate’ you may find results “in” ENVO, because ENVO imports a portion of CHEBI.
Bioportal does a good job of separating out the core concept/IRI from imports of it:
In these cases, the ID is the same, but you should be aware what the true parent ontology is.
OLS also does a good job of collapsing these:
Potential confusion point: Some ontologies replicate parts of other ontologies
This is distinct from the import case above. In this case, one ontology may intentionally or unintentionally duplicate concepts from another. For example, the OMIT ontology copies large amounts of MESH and gives these new IDs. In these cases you should identify the authority and use the ID from there.
For example, a search for cockatoos in Bioportal shows MESH, MESH IDs reused elsewhere, as unlinked concept IDs presumably showing the same concept.
Searching semantic web schemas
For searching for terms in semantic web standards, https://lov.linkeddata.es/ is probably the best
Note that sometimes you need to do more work than just entering a string. Most ontology search tools won’t do stemming etc. I recommend searching for similar concepts and exploring the neighborhood. Understanding the structure of the ontology will help you make a better request.
For example, imagine you are looking for a concept ‘bicycle’ in a product ontology. Just because nothing comes back in a search for ‘bicycle’ doesn’t mean the concept isn’t there. It may be under a synonym like ‘bike’. Explore the ontology. Look for similar concepts like unicycle or car. If you see that the ontology has a class vehicle, subclasses like 4-wheeled, 2-wheeled, and 1-wheeled, but doesn’t have anything under 2-wheeled you can be confident the concept is missing.
Don’t get too hung up on this if you are not a domain scientist and don’t understand the concepts in the ontology, but it is usually a good idea to do this kind of initial exploration.
Mapping or searching for sets of terms
If you have a set of terms to map and you want to get a sense of coverage in different ontologies the parallel tools are:
- Bioportal annotator
This topic is deserving of its own post, so I won’t go into more detail here
Use GitHub for making requests
If you are sure the ontology doesn’t have the concept you need, you will want to make a request
In general, always use GitHub for making the request. If you know the email or slack of someone with ability to make terms it may be tempted to contact them directly but it’s better to use GitHub. This makes the process transparent, and you will be helping people who come after you.
If you cannot find a GitHub repo for the ontology/standard, this is a bad smell and maybe you should reconsider whether you want to use this ontology. Having a public repo (GitHub/GitLab/Bitbucket/etc) is a requirement of OBO. Note this same advice applies to software.
For OBO ontologies, you can easily find the GitHub issue tracker for any ontology via http://obofoundry.org
For some ontologies, there may be specialized term request systems (PRO, CHEBI). Go by the norms of the particular ontology, but my own preference is always to use GitHub, for the reasons stated above.
Search existing issues before making a request. It may be the case that others have requested the same term before you. Maybe it is out of scope, and the term is in a different ontology. Searching the issue tracker should reveal this (this is why it is good to always stay within github when making requests rather than private communication). It may be the case that the ontology refuses to grant requests for reasons that are arbitrary. In this case there may be a issue discussing the pros and cons. Read this and add your voice, but in a constructive fashion. Maybe a simple up-vote is sufficient. Or a comment like “similar to Mary, I also would find a term “Hawaiian pizza” very valuable”.
Always link the issue that you are working on to the term request. In GitHub, this is simply a matter of putting the URL in. You can either link from the request issue back to the parent issue, or from the parent issue to the request.
(Note you will always be working to a issue. If you’re not, stop what you are doing and make one!)
You should also search the issue tracker to see if others have made the same request – avoid making duplicate issues. But you can still comment on existing ones. However, avoid tacking on or extending the scope of an existing issue. If there is a similar issue but your request is different, link to the current issue (#NUMBER – you should know github conventions), e.g. “My request is similar to fred’s in #1234, but I need a foo not a bar”.
Read the CONTRIBUTING.md file
An increasing number of ontologies and other modeling artefacts include this in their repo. It should include guidance for people like you that is more specialized than this generalized guide. Read it!
Provide as much help as possible in your request
Remember, many ontologies are under-funded and requests are often fulfilled by our collaborators. Provide as much help as possible to them. If you are not knowledgeable about the domain, that is OK, but you can still provide context about your project.
e.g. “hey, I have been asked to provide a UI selection box with different pizza types. My boss gave me this list of ten pizza types but I don’t eat pizza, and I’m not sure how to map them to your ontology, and I may have some duplicates, it looks like you don’t have ‘Hawaiian’, but I’m wondering if maybe this ‘pineapple and ham’ is the same thing, or is there some subtle difference? If it’s the same, shouldn’t there be a synonym added?”
If you have been given a spreadsheet, you can provide a link to it. If you are mapping a data table, provide a link, or selected examples, as this can help orient the person fulfilling the request. Remember, people aren’t mind readers.
For example, making a ticket where you say “I need you to add HSap” is not helpful. But if you can say the HSap value appears in a column called species, and the other values are MMus, ‘DMel’, this gives the ontology developer the context they need, avoiding the need for confusing back and fortheon the issue tracker.
Analogies are useful: If you can find analogous terms use these as examples.
If you have a domain scientist handy, you may want to engage them before making requests – e.g. if they can provide definitions.
There is no rule as to whether to make one ticket/issue with multiple terms, or one ticket per term. If you think each term is nuanced and requires individual discussion, make separate tickets. If you are unsure then make an initial exploratory ticket. It’s usually OK to make a ticket for a question (GitHub even has a category/label for this).
Avoid making 100 requests only to discover that all of your requests are out of scope, requiring tedious closing of multiple tickets.
Be proactive and make pull requests
Even ontologies that have dedicated funding are under-resourced. You can help a lot by offering to make pull requests. If the ontology is a well-behaved OBO ontology there should be a clear procedure for doing this (if the ontology was made with ODK or follows ODK conventions, the file you should edit is src/ontology/foo-edit.owl in the repo).
Note that editing the OWL file usually entails using Protege. Basic Protege skills are worth learning. Normally this would not be required of most users, but in my group having basic Protege driving skills is useful and strongly encouraged.
In some cases you don’t need to edit the file – the ontology SOP may dictate editing a TSV in github or google sheet, with this compiled to OWL. Consult the contributor docs for that ontology (and if these are lacking, gently suggest ways to improve this to make it easier for those who come after you to contribute).
You will likely need to be added to an idranges file – again if the ontology follows standard conventions this will be obvious.
It is a good idea to check if an ontology is welcoming of PRs. This should be obvious from the pulls tab in GitHub. In general most ontologies should be, but some ontology groups may have trouble adapting to the times and may still be unfamiliar and may prefer issues. Also in many cases the addition of terms is best done by an expert.
In all cases, use your best judgment!
Follow templates where possible
Many repos are set up with GitHub issue templates. If the repo you are requesting in does not, you may want to gently suggest they do (or better yet, make a PR for this, using ODK as a guide). If you are reading this document then you likely have more github-foo than the ontologist/curator fulfilling the request, you can be helpful!
In some cases, ontologies may have set up a templating system (robot or dosdp). You can be super-helpful and follow the system that has been set up. In some cases this means filling in a predefined google sheet (e.g. with columns for name, definition, parent). In some cases you can make a PR on a TSV in the repo. This is an evolving area, so stay tuned. If the process is not clear there are people in the group with expertise who can help.
If all else fails, make your own ‘application ontology’
Sometimes there may simply be no ontology fit for purpose. Or existing ontologies may simply be unable to fulfil your request. It may be the case that there is an ontology called ‘pizza ontology’ squatting this conceptal space in OBO, but they may fail to grant your term requests for arbitrary reasons (“we don’t add Hawaiian pizzas, as we object in principle to putting pineapple on pizza”), or have unrealistic timelines (“we have a pizza modeling discussion set for 2 years now at the annual pizza ontology conference, we may consider putting your request before the committee then, but it is unlikely to be ratified for 4 years“). They may make it impossible to add terms by being ontological perfectionist (“we will add your pizza if you add perfect OWL logical axiomatization describing topological and gastronomic properties of the pizza according to our undocumented design practice”). They may also simply model things incorrectly (“thanks for your request. We have added ‘Hawaiin Pizza’ as a subclass of ‘Hawaii’. Aloha!”)
In general this won’t happen, especially with well-behaved OBOs, but there may be some holdouts! Be patient, and offer to make PRs (see above).
In some cases, such as those above, you may be justified in making your own ontology, using tools like ODK and ROBOT. Consult first! And never do this without first making requests.
In some cases you don’t need to make a new ontology, you can just create stubs. E.g. for a KG ingest, you can ‘inject’ something into the biolink-model, e.g. biolink:Pizza. There are various downsides to the injection approach, it may be better to use a different namespace. Depending on the project context it may or may not matter if the injected type resolves. Regardless, when doing this, add a comment to your code with a link to the ticket
Be bold and be collaborative
Whether you are making or fulfilling a request, you are all part of the same larger community of people working to make data more computable. Be as constructive and as helpful as possible, but also don’t hold back or be shy. Ultimately the ontology is there to serve you. But if it does not serve your need, is too confusing in some aspect, then it’s likely the same case for others.
Overall the processes described above may seem overly complex or onerous. In fact they are not so different from analogous processes such as getting features into a piece of open source software.
Over the years there have been various proposals and implementations of ‘term brokers’ which act as both triage and a place to get an identifier for a term instantly. An example implementation is TermGenie.
One reason why term brokers have not taken over as a way of getting terms into ontologies over the github procedure above is that there is a strong tendency to accumulate ontological debt (akin to technical debt). It’s easy to stick a bunch of junk terms into an ontology. But maintaining these and dealing with the downstream costs of including these can be very high.
This topic needs a blog post all of its own, stay tuned…