A common structure found in many ontologies is the single child pattern. I consider this an anti-pattern, to be avoided.
The most common form is with is_a children (i.e subClassOf between two named classes), but the anti-pattern also applies to other relationship types. We can formalize the single child subclass pattern as:
C1 direct SubClassOf P
NOT exists some C2, such as C2 is a direct SubClassof P, and C2 != C1
Depicted as:
C1 is the only child of P
One reason this is an anti-pattern is that it is inherentlyincomplete. i.e. there must be instances of P that are not instances of C1 (otherwise why have two classes – see the reflexive subclass anti-pattern). Following a principle of reasonable completeness (see open world post) we should include sibling terms where appropriate.
Here is a concrete example from a fictional ontology:
A subset of a fictional ontology, where only one subtype of flu (mild flu) is populated
Here there is a single specialization of a disease term, based on severity.
Another example (adapted from an existing ontology):
A fictional example, where a chemical assay is subtyped by a property of that chemical pool (its concentration)
Here there is a specialization of the assay term based on a property of the pool or iron.
A different example (adapted from an existing ontology):
An adapted example, where a specific assay has a single subtype, differentiated by location of sampling
This kind of structure is not uncommon in many OBO ontologies. And there is a reasonable defense: we have limited ontology editing resources, and many terms are added on request. Curators are free to request a more specific term if they feel it is necessary for annotating (e.g a disease that has as phenotype mild flu) but they may have no need for the implicit sibling terms. And ontology developers see no need to do additional work they are not requested to do.
However, this leads to lopsided ontologies that are often confusing for people not deeply immersed the development of these ontologies. It is hard to tell if omissions are intentional or unintentional. And the practice of instantiating single children has bad downstream effects of annotation, this is something we have frequently observed over multiple ontologies.
Consider the flu example above. A new annotator may want to annotate a disease that has a severe flu phenotype. They may make an implicit assumption that choosing the parent term ‘flu’ would communicate ‘severe flu’; if it was mild, they would have selected ‘mild flu’. But this is not the explicit assertion they are making – they are making a closed world assumption that doesn’t hold for the logic of the ontology. While some of this can be obviated with training, and ensuring curators request specific sibling terms rather than trying to let the parent do the work. But many single-child cases are in fact more nuanced that the flu example.
Instead, it is better to take a more prospective approach to ontology development, try and anticipate in advance terms that may be required, and populate them in a balanced fashion – this will result in more balanced annotations. It is much easier to do this if you follow OWL axiomatization and have a formal design pattern system such as DOSDPs. In fact you can use such a system to automate detection of single-child patterns and imbalances.
While it is trivial to detect single is-a children using a SPARQL query encoding the pattern above, it won’t capture the more nuanced cases of single children by a given axis of classification.
Consider this made-up ontology structure, where we have a parent class with only two subclasses explicitly populated:
made-up example, the class ‘animal’ has only two child classes populated, one by a taxonomic axis (vertebrate) another by a property of the animal (its edibility)
In this particular example, both are also single-children via an axis of classification. While on a gross structural level the lower terms each has a sibling, each sibling is clearly classified differently. The first is classified along classical taxonomic/evolutionary descent terms, the second is by a different property.
The above example is made-up and would strike most people as bad design (even if strictly logically coherent). Where is the concept of inedible animals, where are the invertebrates (and indeed edible and inedible vertebrates and invertebrates)?
But in fact this antipattern plagues most OBO ontologies. These are harder to spot, especially if the ontology is unaxiomatized.
For example:
iron assay with two child terms along different axes
Structurally this doesn’t look like a single-child anti-pattern, but it is in fact an example of a single-child-by-axis pattern. And if there are no subclasses, this is an instance of the ragged lattice pattern, which I will cover in a future post.
While these can’t be detected by straightforward SPARQL queries, if you use a system such as DOSDPs you can use this to analyze your ontology for these structures, and proactively guard against them.
While the above examples all focus on subClassOf/is-a relations, the same guidelines apply to other edge labels. For example, if an anatomy ontology only listed a single part of the head (such as the mouth):
An incomplete ontology with only one part child for head)
Most people would consider this poor design. While of course it’s true it’s unreasonable to expect ontologies to be complete, the reasonable completeness principle should apply, and if for some reason this not unattainable, at the very least the incompleteness needs to be clearly documented.
In closing, as ontology developers it can be tempting to ignore these single child cases – we have limited resources, and must balance this with being able to provide users with terms they request, which may lead to spottiness and incompleteness. But ignoring these just leads to more work downstream, and in some cases it can lead to incomplete annotation. So avoid single is-a children!
In the first part of this series, I covered the use of disjointness axioms to make it easier to detect logical errors in your ontologies, and how you could use robot reason as part of an ontology release workflow to avoid accidentally releasing incoherent ontologies.
In the second part I covered unintentional inference of equivalence axioms – something that is not inherently incoherent, yet is usually unintended, and how to configure robot to catch these.
In both cases, the standard operating procedure was:
Detect the incoherency using robot
Diagnose the incoherency using the “explain” feature in Protege
Repair the problem my removing or changing offending axioms, either in your ontology (or if you are unlucky, the issue is upstream and you need to coordinate with the developer of an upstream ontology)
In practice, repairing these issues can be very hard. This is compounded if the ontology uses complex hard-to-mentally-reason-over OWL axioms involving deep nesting and unusual features, or if the ontology has ad-hoc axioms not conforming to design patterns. Sometimes even experienced ontology developers can be confounded by long complex chains of axioms in explanations.
But never fear! Help is at hand, there are many in the OBO community who can help! I always recommend making an issue in GitHub as soon as you detect an incoherency. However, you want to avoid having other people do the duplicative work of diagnosing the incoherency. They may need to clone your repo, fire up Protege, wait for the reasoner to sync, etc. You can help people help you by providing as much information up-front as possible.
Previously my recommendation was to paste a screenshot of the Protege explanation in the ticket. This helps a lot as often I can look at one of these and immediately tell what the problem is and how to fix it.
But this was highly imperfect. Screenshots are not searchable by the GitHub search interface, they are not accessible, and the individual classes in the screenshot are not hyperlinked.
A relatively new feature of robot is the explain command, which allows you to generate explanations without firing up Protege. Furthermore, you can generate explanations in markdown format, and if you paste this markdown directly into a ticket it will render beautifully, with all terms clickable!
A recent example was debugging an issue related to fog in ENVO. As someone who lives in the Bay Area, I have a lot of familiarity with fog.
Both the relations (object properties) and classes are hyperlinked, so if you want to find out more about rime just click on it.
In this case the issue is caused by the use of results in formation of where the subject is a material entity, whereas it is intended for processes. This was an example of a “cryptic incoherency”. It went undetected because the complete set of RO axioms were not imported into ENVO (I will cover imports and their challenges in a future post)
The robot explain command is quite flexible, as can be seen from the online help. I usually use it set to report all incoherencies (unsatisfiable classes plus inconsistencies). Sometimes if you have an unsatisfiable class high up in the hierarchy (or high up in the existential dependency graph) then all subclasses/dependent classes will be unsastifiable. In these cases it can help to hone in on the root cause, so the “mode” option can help here.
In the first post in this series on edge properties, I outlined the common reification pattern for placing information on edges in RDF and knowledge graphs, and how this places nicely with RDFStar (formerly RDF*).
There are alternatives to reification/RDF* for placing information on edges with the context of RDF, in this post I will deal with one, the Singleton Property Pattern (SPP), described in Don’t Like RDF Reification? Making Statements about Statements Using Singleton Property (2015), Nguyen et al (PMC4350149).
The idea here is to mint a new property URI for every statement we wish to talk about. Given a triple S P O, we would rewrite as S new(P) O, and add a triple new(P) singlePropertyOf P.
Some of the proposed advantages of this scheme are discussed in the paper.
A variant of the SPP makes the new property a subPropertyOf the original property. I will call this SPP(sub). This has the advantage that the original statement is still entailed.
Then if we assume RDFS entailment, the query becomes simpler:
SELECT ?x ?y WHERE { ?x :interacts_with ?y }
This is because the direct interacts_with triple is entailed due to the subPropertyOf axiom.
In the discussion section of the original paper the authors state that the pattern is compatible with OWL, but don’t provide examples or elucidation, or what is meant by compatible. There is a website that includes an OWL example.
Unfortunately, problems arise if we want to use this same pattern with other axiom types, beyond OWL Object Property Assertions; for example, SubClassOf, EquivalentClasses, SameAs. This is true for both the original SPP and the subPropertyOf variant form.
These problems arise when working in OWL2-DL. Use of OWL2-Full may eliminate these problems, but most OWL tooling and stacks are built on OWL2-DL, so it is important to be aware of the consequences.
OWL-DL is incompatible with the SPP
Unfortunately the SPP pattern only works if the property used in the statement is an actual OWL property.
In the official OWL2 mapping to RDF, predicates such as owl:equivalentClass, owl:sameAs, owl:subClassOf are syntax for constructing an OWL axiom of the corresponding type (respectively EquivalentClasses, SameIndividual and SubClassOf). These axiom types have no corresponding properties at the OWL-DL level.
We do this by making a fresh property (here, arbitrarily called equivalentClass-1) that is the SPP form of owl:equivalentClass. For comparison purposes, we make a normal equivalence axiom between P1 and P2, and we also try an equivalence axiom between P3 and P4 using the SPP(sub) pattern:
PREFIX : <http://example.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
:P1 a owl:Class .
:P2 a owl:Class .
:P3 a owl:Class .
:P4 a owl:Class .
:P1 owl:equivalentClass :P2 .
:P3 :equivalentClass-1 :P4 .
:equivalentClass-1 rdfs:subPropertyOf owl:equivalentClass .
Note that this does not have the intended entailments. We can see this by doing a DL query in Protege. Here we use HermiT but other DL reasoners exhibit the same features.
For the normal equivalence axiom, we get expected entailments (namely that P2 is equivalent to P1):
However, we do not get the intended entailment that P3 is equivalent to P4
We can get more of a clue as to what is going on by converting the above turtle to OWL functional notation, which makes the DL assertions more explicit:
Note that the SPP-ized equivalence axiom has been translated into an owl annotation assertion, which has no logical semantics. However, even if we force the SPP property to be an object property, we still lack the required entailments. In both cases, we induce owl:equivalentClass to be an annotation property, when in fact it is not a property at all in OWL-DL.
The SPP(e) pattern may work provide the intended entailments when using OWL-Full, but this remains to be tested.
Even if this is the case, for pragmatic purposes most of the reasoners in use in the life science ontologies realm are OWL2-DL or a sub-profile like EL++ or RL (e.g. Elk, HermiT).
Use of SPP with plain RDF
If you are only concerned with plain RDF and not OWL, then the above may not be a concern for you. If SPP’s work for your use case, go ahead. But bear in mind that this pattern is not as widely used, and in addition to OWL issues, there may be other reasons to avoid – for example, proliferating the number of properties in your graph may confuse humans and break certain built in assumptions made by some tools. Overall my recommendation would be to avoid the SPP regardless of your use case.
It’s unfortunate that something as basic as placing information on an edge leads to such a proliferation of de-facto standards in the RDF world. I think this is one reason why graph databases such as Neo4J have the edge – they make this much easier, and they don’t force poor users to mentally reason over confusing combinations of W3C specifications and understand esoteric things such as the difference between OWL-DL and OWL-Full.
Hopefully RDFStar will solve some of these issues.
Ontologies, knowledge models, and other kinds of standards are generally not static artefacts. They are created to serve a community, which likely includes you, and they should respond dynamically to serve that community, where resources allow.
The content of many ontologies in OBO such as the Gene Ontology are driven by their respective curator communities. Ontology editors make terms and whole ontology branches prospectively in anticipation of needs, but they also make terms and changes in response to curator needs. New terms can also be requested by data modelers, data engineers, and data scientists, for example, to map categorical data in a dataset in order to allow harmonization and cleaning of data. In many cases, the terms may exist already, they may be hard to find, or they may be spread across a distribution of ontologies, and this may be confusing, especially if you are not familiar with the area.
I wrote this guide primarily with the data engineer audience in mind, as this community may be less familiar with norms and tacit knowledge around the ontology development lifecycle. However, much of what I say is applicable to curators as well. I also wrote this document originally for members of my group, who are expected to contribute back to OBO and the work of ontology developers. Some of the recommendations may therefore seem a little onerous in places, but they should still hopefully be useful and adaptable to a broader audience. And in all cases, for open community ontologies, it is worth bearing in mind the maxim that ‘you get out what you put in’.
As a data scientist, I need terms describing oxygen-requirement traits, so I can combine tabular microbial trait data to predict traits from other features
Alternate scenario: building a microbial Knowledge Graph (see this issue)
As a database developer, I need terms for sequence variants, so I can map categorical values in my database making it FAIR
As a knowledge graph builder, I need relations from the Relation Ontology or Biolink, so I can standardize edge labels in my graph
As a GO ontology developer, I need terms from the cell ontology, so I can provide logical defining axioms for terms in the cell differentiation branch
As a microbiome scientist, I need terms from ENVO/PO, so I can fill in MIxS-compliant environmental fields when I submit my sample data, making it FAIR
As an environmental genomics standards provider, I need terms from ENVO, so I can map enumerated values/dropdowns to an ontology when developing the MIxS standard
As a data modeler / standards provider, I need SO terms for genomic feature types, to define a value set for a genomics exchange format (e.g. GFF)
As a schema developer, I need terms for describing properties of sequence assemblies, e.g. number of aligned reads, N50, in order to make my sequencing schema FAIR
These scenarios encompass a range of different kinds of person with varying levels of expertise and commitment. The primary audience for this document is members of my group and the projects we are involved with (GO, Monarch, NMDC, Translator) but many of the recommendations will apply more broadly. But we would not expect the average scientist who is submitting a dataset to engage at the same level through GitHub etc (see the end of the document for discussion on approaches for making the overall process easier).
20ish simple rules for selecting and requesting ontology terms
Be a good open science citizen
The work you are doing is part of a larger open science project, and you should have a community minded attribute.
When you request terms from ontologies and you provide information to help you should be microcredited, e.g. your orcid will be associated with the term. Remember that many efforts are voluntary or unfunded, and people are not necessarily paid to help you. Provide help where you can, provide context when making requests, and any background explicit or tacit knowledge that may help.
Always be respectful and appreciative when interacting with providers of terms or other curators. Follow codes of conduct.
Use the appropriate ontology or standard: avoid pick-and-mix
Depending on the context of your project, there may be mandated or recommended standards or ontologies. These may be explicit or implicit.
If you are performing curation, it is likely that the ontology you use is fixed by your curation best practices and even your tools. For example, GO curators (obviously) use GO. But for other purposes it may not be obvious which ontology to use (and even with GO, curators have a choice of ontologies for providing additional context as extensions or in GO-CAMs). There are a large number of them, with confusing overlaps, and lots of tacit community knowledge that may not be immediately available to you.
Some general guidelines:
Look in the appropriate place, depending on what kind of term you need
If you need a classification term or a descriptor, then use an ontology
If you need something like a gene or a variant “term” then an ontology may not be appropriate, use the appropriate database instead, with caveats
If you need a property to describe a piece of data, then you may need to look in existing semantics schemas, e.g. schemas encoded in RDFS, a shape language such as ShEx, or LinkML
When looking for an ontology term, favor OBO over non-OBO resources
Sometimes better coverage is only available outside OBO – e.g. EDAM has a lot more terms for describing bioinformatics software artefacts, and EDAM is not in OBO. But it is still good to engage owners of the appropriate OBO ontology
When a non-OBO ontology is selected, use the OBO principles and guidelines to help evaluation – e.g is the ontology open? Does it follow good identifier lifecycle management?
The OLS uses a broader selection of ontologies that is narrower than what is in Bioportal. In my experience the non-OBO ontologies they include are quite pragmatic choices in many situations, e.g EDAM.
Even within OBO, there may still be confusion as to which ontologies to use, especially when many seem to have overlapping concepts, and scope may be poorly defined.
An example is the term used to describe an organism’s core metabolism for our microbial knowledge graph KG-Microbe, with multiple OBO contenders.
Always consult http://obofoundry.org to glean information about the ontology. This is always the canonical unbiased source, and includes curated up-to-date metadata
A crucial piece of metadata that is in OBO is the ontology status – you must avoid using ontologies that are obsoleted, and you should avoid using ontologies that are marked inactive
Look at the ‘usages’ field in OBO. Has the ontology been used for similar purposes as what you intend? If the ontology has no usages, this is a worrying sign the ontology was made for ontologists rather than practical data annotators such as yourself (but note that some ontologies may be behind in curating their usages into OBO)
Look at the scope of the ontology, as defined on the OBO page. Is it well defined and clear? If not, consider avoiding. Is your term in scope for this ontology? If not then don’t use terms from the ontology just because the labels match.
Is the ontology an application ontology, i.e. an ontology that is not intended to be a reference for terms within a domain? If so it may not be fit for your intended use.
Consult others if in doubt. Many people in the group or in our funded projects are involved with specific ontologies.
You should be on the OBO slack, this is a good place to get advice.
Favor ontologies we are actively involved with or that follow similar data models and principles
Favor more active ontologies. OBO marks inactive ontologies with metadata tags that are clearly displayed, but you should still check
Is the github project active?
Are there many tickets that are never answered?
If you suspect an ontology is not active but it is not marked as such, be a good citizen and raise this on the OBO tracker
Use precedence – see what has been done previously in similar projects
We are actively working on projects like OBO Dashboard and on improving OBO metadata to help ontology selection
For any candidate term
Is it obsoleted? If so avoid, but look at the metadata for replacements
Does it have a definition? A core OBO principle is that reasonable attempts must be made to define terms in an ontology
Is there a taxonomic scope? Always use the appropriate taxonomic scope. If an uberon term is restricted to vertebrates, it is valid to use for humans. But if an ontology or term is designed for use with mouse, it may not be valid to apply terms for humans
Have others used the term?
If you have formal ontology training, avoid over-ontologizing in your thought processes for selection. See for example the section below on shadow terms.
Avoid terms that seem over-ontologized; e.g. that have strange labels a domain scientist would not understand
If you are looking for terms to categorize nodes or edges in a Knowledge Graph:
For most of our projects, KGs should conform to the biolink-model, so this is the appropriate place to search
Note that biolink still leverages OBO and standard bioinformatics databases for the nodes themselves; biolink classes and predicates are used for the node categories and edge labels
For environmental samples use GSC MIxS terms for column headers
The biolink model will serve as a canonical guide for what kinds of IDs should be used for any kind of entity. The SOP is to find the category of relevance in biolink, and then examine the id_prefixes field. This indicates the resources that provide identifiers that are valid to use for that entity type, in priority order.
For example, for BiologicalProcess you will see on the page and in the yaml
id_prefixes:
– GO
– REACT
– MetaCyc
– KEGG.MODULE ## M number
Figure: portion of Biolink page for the data class BiologicalProcess. The favored ID prefixes are shown
This means that GO is always our favored ontology / ID space for representing a biological process. This followed by reactome, then metacyc, then kegg. Of course, GO and Reactome serve different purposes, with reactome pathway IDs classified using GO IDs. If you disagree with this ordering you can make a PR on biolink (or you can also make a project-specific extensions/contraction of biolink).
Avoid a pick-and-mix approach. It is better to draw like terms from the same ontology, this ensures overall coherence, and allows reasoning to be better leveraged.
If you are creating a LinkML enum, a good rule of thumb is that all ‘meaning’ annotations should come from the same ontology. Of course, this may not always be the case.
For example, the enum for sex_chromosome_type in chromo is all drawn from GO:
SexChromosomeType:
description: >-
what type of sex chromosome
permissible_values:
X:
meaning: GO:0000805
Y:
meaning: GO:0000806
W:
meaning: GO:0000804
Z:
meaning: GO:0000807
Similarly for the gp2term relationship field in GPAD, these are all drawn from RO:
(note that part_of, BFO:0000050, is actually in RO, not BFO, despite the ID space)
However, entity type is drawn from SO, GO, and PR.
It is a bad smell to have a mix of different ontologies for what should be a set of similar entities, e.g
Some ontologies are themselves ad-hoc in their scoping, which can make it harder to determine which ontologies to go to find terms or request terms. Always favor ontologies with clear scope. We are actively working to fix scope problems in OBO:
Many ontologies mint “shadow concepts”. For example OBA may have the core concept of “blood pressure”. Another ontology many have a random mix of “datum” or “measurement” classes, e.g. “blood pressure datum”. Avoid these terms. Even if you want to describe a blood pressure measurement, just use the core concept. The fact that the concept is deployed in the context of a measurement should be communicated externally, e.g. in the data model you are using, not by precoordination.
By using the core concept you increase the overall coherency and connectivity of the information you are describing. Many shadow terms are in ‘application ontologies’ and are not properly linked to the core concepts.
Note that my own recommendations may not be aligned with the broader OBO community – see this ticket for further discussion.
Exercise due diligence in looking for the terms
Make sure the concept you need is definitely not present in the ontology before requesting.
Learn how to use ontology search tools appropriately. I recommend:
OLS for OBO and selected other ontologies
Bioportal for searching the broadest set of ontologies
Bioportal has the broadest collection (including all of OBO), but there is less of a filter. Ontologies may not be open. However, being in OBO is not a guarantee of quality, and there may be good reasons to use a non OBO ontology.
Expert ontologists may like to use Ontobee, but there are many things to be aware of before using it:
The update frequency is less than OLS
It does not display the partonomy, which is crucial for understanding many of the ontologies we work on
Overall it presents a more ‘close to the base metal’ OWL model. This is fine for ontologists, but it is better not to point biologists here
If you are not experienced with ontologies and in particular OBO, there are many things that potentially trip you up. Don’t be afraid to ask about these — many people in the same shoes have you have been confused.
Potential confusion point: Some ontologies import other ontologies, or parts of other ontologies
This means e.g. if you are searching for a chemical element like ‘nitrate’ you may find results “in” ENVO, because ENVO imports a portion of CHEBI.
Bioportal does a good job of separating out the core concept/IRI from imports of it:
In these cases, the ID is the same, but you should be aware what the true parent ontology is.
OLS also does a good job of collapsing these:
Ontobee:
Potential confusion point: Some ontologies replicate parts of other ontologies
This is distinct from the import case above. In this case, one ontology may intentionally or unintentionally duplicate concepts from another. For example, the OMIT ontology copies large amounts of MESH and gives these new IDs. In these cases you should identify the authority and use the ID from there.
For example, a search for cockatoos in Bioportal shows MESH, MESH IDs reused elsewhere, as unlinked concept IDs presumably showing the same concept.
Note that sometimes you need to do more work than just entering a string. Most ontology search tools won’t do stemming etc. I recommend searching for similar concepts and exploring the neighborhood. Understanding the structure of the ontology will help you make a better request.
For example, imagine you are looking for a concept ‘bicycle’ in a product ontology. Just because nothing comes back in a search for ‘bicycle’ doesn’t mean the concept isn’t there. It may be under a synonym like ‘bike’. Explore the ontology. Look for similar concepts like unicycle or car. If you see that the ontology has a class vehicle, subclasses like 4-wheeled, 2-wheeled, and 1-wheeled, but doesn’t have anything under 2-wheeled you can be confident the concept is missing.
Don’t get too hung up on this if you are not a domain scientist and don’t understand the concepts in the ontology, but it is usually a good idea to do this kind of initial exploration.
Mapping or searching for sets of terms
If you have a set of terms to map and you want to get a sense of coverage in different ontologies the parallel tools are:
Zooma
Bioportal annotator
This topic is deserving of its own post, so I won’t go into more detail here
If you are sure the ontology doesn’t have the concept you need, you will want to make a request
In general, always use GitHub for making the request. If you know the email or slack of someone with ability to make terms it may be tempted to contact them directly but it’s better to use GitHub. This makes the process transparent, and you will be helping people who come after you.
If you cannot find a GitHub repo for the ontology/standard, this is a bad smell and maybe you should reconsider whether you want to use this ontology. Having a public repo (GitHub/GitLab/Bitbucket/etc) is a requirement of OBO. Note this same advice applies to software.
For OBO ontologies, you can easily find the GitHub issue tracker for any ontology via http://obofoundry.org
For some ontologies, there may be specialized term request systems (PRO, CHEBI). Go by the norms of the particular ontology, but my own preference is always to use GitHub, for the reasons stated above.
Search existing issues before making a request. It may be the case that others have requested the same term before you. Maybe it is out of scope, and the term is in a different ontology. Searching the issue tracker should reveal this (this is why it is good to always stay within github when making requests rather than private communication). It may be the case that the ontology refuses to grant requests for reasons that are arbitrary. In this case there may be a issue discussing the pros and cons. Read this and add your voice, but in a constructive fashion. Maybe a simple up-vote is sufficient. Or a comment like “similar to Mary, I also would find a term “Hawaiian pizza” very valuable”.
Always link the issue that you are working on to the term request. In GitHub, this is simply a matter of putting the URL in. You can either link from the request issue back to the parent issue, or from the parent issue to the request.
(Note you will always be working to a issue. If you’re not, stop what you are doing and make one!)
You should also search the issue tracker to see if others have made the same request – avoid making duplicate issues. But you can still comment on existing ones. However, avoid tacking on or extending the scope of an existing issue. If there is a similar issue but your request is different, link to the current issue (#NUMBER – you should know github conventions), e.g. “My request is similar to fred’s in #1234, but I need a foo not a bar”.
Read the CONTRIBUTING.md file
An increasing number of ontologies and other modeling artefacts include this in their repo. It should include guidance for people like you that is more specialized than this generalized guide. Read it!
Provide as much help as possible in your request
Remember, many ontologies are under-funded and requests are often fulfilled by our collaborators. Provide as much help as possible to them. If you are not knowledgeable about the domain, that is OK, but you can still provide context about your project.
e.g. “hey, I have been asked to provide a UI selection box with different pizza types. My boss gave me this list of ten pizza types but I don’t eat pizza, and I’m not sure how to map them to your ontology, and I may have some duplicates, it looks like you don’t have ‘Hawaiian’, but I’m wondering if maybe this ‘pineapple and ham’ is the same thing, or is there some subtle difference? If it’s the same, shouldn’t there be a synonym added?”
If you have been given a spreadsheet, you can provide a link to it. If you are mapping a data table, provide a link, or selected examples, as this can help orient the person fulfilling the request. Remember, people aren’t mind readers.
For example, making a ticket where you say “I need you to add HSap” is not helpful. But if you can say the HSap value appears in a column called species, and the other values are MMus, ‘DMel’, this gives the ontology developer the context they need, avoiding the need for confusing back and fortheon the issue tracker.
Analogies are useful: If you can find analogous terms use these as examples.
If you have a domain scientist handy, you may want to engage them before making requests – e.g. if they can provide definitions.
There is no rule as to whether to make one ticket/issue with multiple terms, or one ticket per term. If you think each term is nuanced and requires individual discussion, make separate tickets. If you are unsure then make an initial exploratory ticket. It’s usually OK to make a ticket for a question (GitHub even has a category/label for this).
Avoid making 100 requests only to discover that all of your requests are out of scope, requiring tedious closing of multiple tickets.
Be proactive and make pull requests
Even ontologies that have dedicated funding are under-resourced. You can help a lot by offering to make pull requests. If the ontology is a well-behaved OBO ontology there should be a clear procedure for doing this (if the ontology was made with ODK or follows ODK conventions, the file you should edit is src/ontology/foo-edit.owl in the repo).
Note that editing the OWL file usually entails using Protege. Basic Protege skills are worth learning. Normally this would not be required of most users, but in my group having basic Protege driving skills is useful and strongly encouraged.
In some cases you don’t need to edit the file – the ontology SOP may dictate editing a TSV in github or google sheet, with this compiled to OWL. Consult the contributor docs for that ontology (and if these are lacking, gently suggest ways to improve this to make it easier for those who come after you to contribute).
You will likely need to be added to an idranges file – again if the ontology follows standard conventions this will be obvious.
It is a good idea to check if an ontology is welcoming of PRs. This should be obvious from the pulls tab in GitHub. In general most ontologies should be, but some ontology groups may have trouble adapting to the times and may still be unfamiliar and may prefer issues. Also in many cases the addition of terms is best done by an expert.
In all cases, use your best judgment!
Follow templates where possible
Many repos are set up with GitHub issue templates. If the repo you are requesting in does not, you may want to gently suggest they do (or better yet, make a PR for this, using ODK as a guide). If you are reading this document then you likely have more github-foo than the ontologist/curator fulfilling the request, you can be helpful!
In some cases, ontologies may have set up a templating system (robot or dosdp). You can be super-helpful and follow the system that has been set up. In some cases this means filling in a predefined google sheet (e.g. with columns for name, definition, parent). In some cases you can make a PR on a TSV in the repo. This is an evolving area, so stay tuned. If the process is not clear there are people in the group with expertise who can help.
If all else fails, make your own ‘application ontology’
Sometimes there may simply be no ontology fit for purpose. Or existing ontologies may simply be unable to fulfil your request. It may be the case that there is an ontology called ‘pizza ontology’ squatting this conceptal space in OBO, but they may fail to grant your term requests for arbitrary reasons (“we don’t add Hawaiian pizzas, as we object in principle to putting pineapple on pizza”), or have unrealistic timelines (“we have a pizza modeling discussion set for 2 years now at the annual pizza ontology conference, we may consider putting your request before the committee then, but it is unlikely to be ratified for 4 years“). They may make it impossible to add terms by being ontological perfectionist (“we will add your pizza if you add perfect OWL logical axiomatization describing topological and gastronomic properties of the pizza according to our undocumented design practice”). They may also simply model things incorrectly (“thanks for your request. We have added ‘Hawaiin Pizza’ as a subclass of ‘Hawaii’. Aloha!”)
In general this won’t happen, especially with well-behaved OBOs, but there may be some holdouts! Be patient, and offer to make PRs (see above).
In some cases, such as those above, you may be justified in making your own ontology, using tools like ODK and ROBOT. Consult first! And never do this without first making requests.
In some cases you don’t need to make a new ontology, you can just create stubs. E.g. for a KG ingest, you can ‘inject’ something into the biolink-model, e.g. biolink:Pizza. There are various downsides to the injection approach, it may be better to use a different namespace. Depending on the project context it may or may not matter if the injected type resolves. Regardless, when doing this, add a comment to your code with a link to the ticket
Be bold and be collaborative
Whether you are making or fulfilling a request, you are all part of the same larger community of people working to make data more computable. Be as constructive and as helpful as possible, but also don’t hold back or be shy. Ultimately the ontology is there to serve you. But if it does not serve your need, is too confusing in some aspect, then it’s likely the same case for others.
Discussion
Overall the processes described above may seem overly complex or onerous. In fact they are not so different from analogous processes such as getting features into a piece of open source software.
Over the years there have been various proposals and implementations of ‘term brokers’ which act as both triage and a place to get an identifier for a term instantly. An example implementation is TermGenie.
One reason why term brokers have not taken over as a way of getting terms into ontologies over the github procedure above is that there is a strong tendency to accumulate ontological debt (akin to technical debt). It’s easy to stick a bunch of junk terms into an ontology. But maintaining these and dealing with the downstream costs of including these can be very high.
This topic needs a blog post all of its own, stay tuned…
A frequent source of confusion with ontologies and more generally with any kind of information system is the Open World Assumption. This trips up novice inexperienced users, but as I will argue in this post, information providers could do much more to help these users. But first an explanation of the terms:
With the Open World Assumption (OWA) we do not make any assumptions based on the absence of statements. In contrast, with the Closed World Assumption (CWA), if something is not explicitly stated to be true, it is assumed to be false. As an example, consider a pet-owner database with the following facts:
Fred type Human .
Farrah type Human .
Foofoo type Cat .
Fido type Dog .
Fred owns Foofoo .
Farrah owns Fido.
Depicted as:
depiction of pet owners RDF graph, with triples indicated by arrows. RDF follows the OWA: the lack of a triple between Fred and Fido does not entail that Fred doesn’t own Fido.
Under the CWA, the answer to the question “how many cats does Fred own” is 1. Similarly, for “how many owners does Fido have” the answer also 1.
RDF and OWL are built on the OWA, where the answer to both question is: at least 1. We can’t rule out that Fred also owns Fido, or that he owns animals not known to the database. With the OWA, we can answer the question “does Fred own Foofoo” decisively with a “yes”, but if we ask “does Fred own Fido” the answer is “we don’t know”. It’s not asserted or entailed in the database, and neither is the negation.
Ontology formalisms such as OWL are explicitly built on the OWA, whereas traditional database systems have features constructed on the CWA.
OWL gives you mechanisms to add closure axioms, which allows you to be precise about what is known not be to true, in addition what is known to be true. For example, we can state that Fred does not own Fido, which closes the world a little. We can also state that Fred only owns Cats, which closes the world further, but still does not rule out that Fred owns cats other than Foofoo. We can also use an OWL Enumeration construct to exhaustively list animals Fred does own, which finally allows the answer to the question “how many animals does Fred own” with a specific number.
OWL ontologies and databases (aka ABoxes) often lack sufficient closure axioms in order to answer questions involving negation or counts. Sometimes this is simply because it’s a lot of work to add these additional axioms, work that doesn’t always have a payoff given typical OWL use cases. But often it is because of a mismatch between what the database/ontology author thinks they are saying, and what they are actually saying under the OWA. This kind of mismatch intent is quite common with OWL ontology developers.
Another common trap is reading OWL axioms such as Domain and Range as Closed World constraints, as they might be applied in a traditional database or a CWA object-oriented formalism such as UML.
Consider the following database plus ontology in OWL, where we attempt to constrain the ‘owns’ property only to humans
owns Domain Human
Fido type Dog
Fred type Human
Fido owns Fred
We might expect this to yield some kind of error. Clearly using our own knowledge of the world something is amiss here (likely the directions of the final triple has been accidentally inverted). But if we are to feed this in to an OWL reasoner to check for incoherencies (see previous posts on this topic), then it will report everything as consistent. However, if we examine the inferences closely, we will see that it is has inferred Fido to be both a Dog and a Human. It is only after we have stated explicit axioms that assert or entail Dog and Human are disjoint that we will see an inconsistency:
OWL reasoner entailing Fido is both a Dog and a Human, with the latter entailed by the Domain axiom. Note the ontology is still coherent, and only becomes incoherent when we add a disjointness axiom
In many cases the OWA is the most appropriate formalism to use, especially in a domain such as the biosciences, where knowledge (and consequently our databases) is frequently incomplete. However, we can’t afford to ignore the fact that the OWA contradicts many user expectations about information systems, and must be pragmatic and take care not to lead users astray.
BioPAX and the Open World Assumption
BioPAX is an RDF-based format for exchanging pathways. It is supposedly an RDF/OWL-based standard, with an OWL ontology defining the various classes and properties that can be used in the RDF representation. However, by their own admission the authors of the format were not aware of OWL semantics, and the OWA specifically, as explained in the official docs in the level 2 doc appendix, and also further expanded on in a paper from 2005 by Alan Ruttenberg, Jonathan Rees, and Joanne Luciano, Experience Using OWL DL for the Exchange of Biological Pathway Information, in particular the section “Ambushed by the Open World Assumption“. This gives particular examples of where the OWA makes things hard that should be easy, such as enumerating the members of a protein complex (we typically know all the members, but the BioPAX RDF representation doesn’t close the world).
BioPAX ontology together with RDF instances from EcoCyc. Triples for a the reaction 2-iminopropanoate + H2O → pyruvate + ammonium is shown. The reaction has ‘left’ and ‘right’ properties for reactants such as H2O. These are intended to be exhaustive but the lack of closure axioms means that we cannot rule out additional reactants for this reaction.
The Ortholog Conjecture and the Open World Assumption
In 2011 Nehrt et al made the controversial claim that they had overturned the ortholog conjecture, i.e they claimed that orthologs were less functionally similar than paralogs. This was in contrast to the received wisdom, i.e if a gene duplicates with a species lineage (paralogs) there is redundancy and one copy is less constrained to evolve a new function. Their analysis was based on semantic similarity of annotations in the Gene Ontology.
The paper stimulated a lot of discussion and follow-up studies and analyses. We in the Gene Ontology Consortium published a short response, “On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report“. In this we pointed out that the analysis assumed the CWA (absence of function assignment means the gene does not have that function), whereas GO annotations should be interpreted under the OWA (we have an explicit way of assigning that a gene does not have a function, rather than relying on absence). Due to bias in GO annotations, paralogs may artificially have higher functional similarity scores, rendering the original analysis insufficient to reject the ortholog conjecture.
The OWA in GO annotations is also addressed in the GO Handbook in the chapter Pitfalls, Biases, and Remedies by Pascale Gaudet. This chapter also makes the point that OWA can be considered in the context of annotation bias. For example, not all genes are annotated at the same level of specificity. The genes that are annotated reflect biases in experiments and publication, as well as what is selected to be curated.
Open World Assumption Considered (Sometimes) Harmful
The OWA is harmful where it grossly misaligns with use expectations.
While a base assumption of OWA is usually required with any curated information, it is also helpful to think in terms of an overriding implicit contract between any information provider and information consumer: any (good) information provider attempts to provide as complete information as is possible, given resource constraints.
My squid has no tentacles
Let’s take an example: If I am providing an ontology I purport to be an anatomical ontology for squid, then it behooves me to make sure the main body parts found in a squid are present.
Chiroteuthis veranii, the long armed squid, By Ernst Haeckel, with two elongated tentacles.
Let’s say my ontology contains classes for some squid body parts such as eye, brain, yet lacks classes for others such as the tentacle. A user may be surprised and disappointed when they search for tentacle and come back empty-handed (or empty tentacled, if they are a squid user). If this user were to tell me that my ontology sucked, I would be perfectly within my logical rights to retort: “sir, this ontology is in OWL and thus follows the Open World Assumption; as such the absence of a tentacle class in my squid ontology does not entail that squids lack tentacles, for such a claim would be ridiculous. Please refer to this dense interlinked set of documents published by the W3C that requires PhD in logic to understand and cease from making unwarranted assumptions about my ontology“.
Yet really the user is correct here. There should be an assumption of reasonable coverage, and I have violated that assumption. The tentacle is a major body part, it’s not like I have omitted some obscure neuroanatomical region. Is there a hard and fast dividing line here? No, of course not. But there are basic common sense principles that should be adhered to, and if they cannot be adhered to, omissions and biases should be clearly documented in the ontology to avoid confusing users.
This hypothetical example is made up, but I have seen many cases where biases and omissions in ontologies confusingly lead the user to infer absence where the inference is unwarranted.
Hydroxycholoroquine and COVID-19
The Coronavirus Infectious Disease Ontology (CIDO) integrates a number of different ontologies and includes axioms connecting terms or entities using different object properties. An example is the ‘treatment-for’ edge which connects diseases to treatments. Initially the ontology only contained a single treatment axiom, between COVID-19 and Hydroxychloroquine (HCQ). Under the OWA, this is perfectly valid: COVID-19 has been treated with HCQ (there is no implication about whether treatment is successful or not). However, the inclusion of a single edge of this type is at best confusing. A user could be led to believe there was something special about HCQ compared to other treatments, and the ontology developers had deliberately omitted these. In fact initial evidence for HCQ as a successful treatment has not panned out (despite what some prominent adherents may say). There are many other treatments, many of which are in different clinical trial phases, many of which may prove more effective, yet assertions about these are lacking in CIDO. In this particular case, even though the OWA allows us to legitimately omit information, from a common sense perspective, less is more here: it is better to include no information about treatments at all rather than confusingly sparse information. Luckily the CIDO developers have rapidly addressed this situation.
Ragged Lattices, including species-specific classes
An under-appreciated problem is the confusion ragged ontology lattices can cause users. This can be seen as a mismatch between localized CWA expectations on the part of the user and OWA on the part of the ontology provider. But first an explanation of what I mean by ragged lattice:
Many ontologies are compositional in nature. In a previous post we discussed how the Rector Normalization pattern could be used to automate classification. The resulting multi-parent classification forms a lattice. I have also written about how we should embrace multiple inheritance. One caveat to both of these pieces is that we should be aware of the confusion that can be caused by inconsistently populated (‘ragged’) lattices.
Take for example cell types, which can be classified along a number of orthogonal axes, many intrinsic to the cell itself – its morphological properties, it’s lineage, its function, or gene products expressed. The example below shows the leukocyte hierarchy in CL, largely based on intrinsic properties:
Protege screenshot of the cell ontology, leukocyte hierarchy
Another way to classify cells is by anatomical location. In CL we have a class ‘kidney cell’ which is logically defined as ‘any cell that is part of a kidney’. This branch of CL recapitulates anatomy at the upper levels.
kidney cell hierarchy in CL, recapitulating anatomical classification
so far, perfectly coherent. However, the resulting structure can be confusing to someone now used to thinking in OWL and the OWA. I have seen many instances where a user will go to a branch of CL such as ‘kidney cell‘ and start looking for a class such as ‘mast cell‘. It’s perfectly reasonable for them to look here, as mast cells are found in most organs. However, CL does not place ‘mast cell’ as a subclass of ‘kidney cell’ as this would entail that all mast cells are found in the kidney. And, CL has not populated the cross-product of all the main immune cell types with the anatomical structures in which they can be found. The fleshing out of the lattice is inconsistent, leading to confusion caused by violation of an assumed contract (provision of a class “kidney cell” and incomplete cell types underneath).
This is even more apparent if we introduce different axes of classification, such as the organism taxon in which the cell type is found, e.g. “mouse lymphocyte”, “human lymphocyte”:
inferred hierarchy when we add classes following a taxon design pattern, e.g. mouse cell, mouse lymphocyte. Only a small set of classes in the ontology are mouse specific.
Above is a screenshot of what happens when we introduce classes such as ‘mouse cell’ or ‘mouse lymphocyte’. We see very few classes underneath. Many people indoctrinated/experienced with OWL will not have a problem with this, they understand that these groupings are just for mouse-specific classes, and that the OWA holds, and absence of a particular compositional class, e.g. “mouse neuron” does not entail that mice do not have neurons.
One ontology in which the taxon pattern does work is the protein ontology, which includes groupings like “mouse protein”. PRO includes all known mouse proteins under this, so the classification is not ragged in the same way as the examples above.
There is no perfect solution here. Enforcing single inheritance does not work. Compositional class groupings are useful. However, ontology developers should try and avoid ragged lattices, and where possible populate lattices consistently. We need better tools here, e.g. ways to quantitative measure the raggedness of our ontologies.
Ontologies and databases should better document biases and assumptions
As providers of information, we need to do a better job of making all assumptions explicit and well-documented. This applies particularly to any curated corpus of knowledge, but in particular to ontologies. Even though hiding behind the OWA is logically defensible, we need to make things more intuitive for users.
It’s not uncommon for an ontology to have excellent very complete coverage of one aspect of the domain, and to be highly incomplete in another (reflecting either the biases/interests of the developers, or of the broader field). In fact I have been guilty of this in ontologies I have built or contributed to. I have seen users become confused when a class they expected to find was not present, or they have been perplexed by the ragged lattice problem, or an edge they expected to find was not present.
Few knowledge bases can ever be complete, but we can do better at documenting known unknowns or incompletenesses. We can imagine a range of formal computable ways of doing this, but a good start would be some simple standard annotation properties that can be used as inline documentation in the ontology. Branches of the ontology could be tagged in this way, e.g. to indicate that ‘kidney cell’ doesn’t include all cells found in the kidney, only kidney specific ones; or that developmental processes in GO are biased towards human and model organisms. This system could also be used for Knowledge Graphs and annotation databases too, to indicate that particular genes may be under-studied or under-annotated, an extension of the ND evidence type used in GO.
In addition we could do a better job at providing consistent levels of coverage of annotations or classes. There are tradeoffs here, as we naturally do not want to omit anything, but we can do a better job at balancing things out. Better tools are required here for detecting imbalances and helping populate information in a more balanced consistent fashion. Some of these may already exist and I’m not aware of them – please respond in the comments if you are aware of any!
There is a lot we still have to learn about SARS-CoV-2 and the disease it causes in humans. One aspect of the virus that we do know a lot about is its underlying molecular blueprint. We have the core viral genome, and broadly speaking we know the ‘parts list’ of proteins that are translated and spliced from the genome. There is a lot that we still don’t know about the proteins themselves – how variations affect the ability of the virus to infect a host, which molecules bind to these proteins and how that binding impacts their function. But at least we know the basic parts list.Or we think we do. There is the Spike protein (S), which adorns the surface of this virus like a crown, hence the name ‘coronavirus’. There are the 16 ‘non-structural proteins’ formed by cleavage of a viral polyprotein; one such protein is nsp5 which functions as a protease that performs this same cleavage. And there are the accessory proteins, such as the mysterious ORF8. The genomic blueprint and the translated and cleaved products can be illustrated visually:
Each of these proteins has a variety of different names, for example, nsp3 is also known as PLpro. The genome is small enough that most scientists working on it have memorized the core aliases such that human-to-human communication is relatively unproblematic.
Of course, as we all know, relying on gene and protein symbols for unique identification in a database for machine-machine communication is a recipe for disaster. Symbols are inherently ambiguous, so we assign identifiers to entities in order to disambiguate them. These identifiers can then be adorned with metadata such as symbols, names, aliases, descriptions, functional descriptions and so on.
As everyone working in bioinformatics knows, different databases assign different identifiers for the same entity (by varying definitions of ‘same’), creating the ubiquitous identifier mapping problem and a cottage industry in mapping solutions.
This is a perennial problem for all omics entities such as genes and proteins, regardless of the organism or system being studied. But when it comes to SARS-CoV-2, things are considerably worse.
It turns out that many problems arise from the relatively simple biological phenomena of cleavage of viral polyproteins. While the molecular biology is not so difficult (one parent protein as a source for many derivative proteins), many bioinformatics databases are not designed with this phenomena in mind. This is fine for scenarios where we can afford to gloss over differences between the immediate products of translation and downstream cleavage products. While cleavage certainly happens in the human genome (e.g POMC), it’s rare enough to effectively ignore in some contexts (although arguably this causes a lot of problems too). However, the phenomena of a single translation product producing multiple functionally discrete units is much more common in viruses, which creates issues for many databases when creating a useful ‘canonical parts list’.
The roll-up problem
The first problem is that many databases either ignore the cleavage products or don’t assign them identifiers in the same category as other proteins. This has the effect of ‘rolling up’ all data to the polyprotein. This undercounts the number of true proteins, and does not provision identifiers for distinct functional entities.
For example, NCBI Gene does a fantastic job of assembling the genetic parts lists for genes across all cellular organisms and viruses. Most of the time, the gene is an appropriate unit of analysis, and we can use gene identifiers as proxies for the product transcribed and translated from that gene. In the case of SARS-CoV-2, NCBI mints a gene ID for the polyprotein (e.g. 1ab), but lacks distinct gene IDs for individual cleavage products ,even though each arguably fulfill the definition of discrete genes, and each is a discrete non-overlapping unit with a distinct function. Referring to the figure above, nsp1-10 are all ‘rolled up’ into the 1ab or 1a polyprotein entity.
Now this is perhaps understandable given that the NCBI Gene database is centered on genes (they do provide distinct protein IDs for the products, see later), and the case can be made that we should only have gene IDs for the immediate protein products (e.g polyproteins and structural proteins and accessory ORFs).
But the roll-up problem also exists for dedicated protein databases such as UniProt. UniProt mint IDs for polyproteins such as 1ab, but there is no UniProt accession for nsp1-16. These are ‘rolled up’ into the 1ab entry, as shown in the screenshot:
UniProt entry for viral polyprotein 1ab. Function summaries for distinct proteins (nsp1-3 shown, others below the fold) are rolled up to the polyprotein level
However, UniProt do provide identifiers for the various cleavage products, these are called ‘chain IDs’, and are of the form PRO_nnnnn. For example, an identifier for the nsp3 product is PRO_0000449621). Due to the structure of these IDs they are sometimes called ‘PRO IDs’ (However, they should not be confused with IDs from the Protein Ontology, which are also called ‘PRO IDs’. Confusing, huh?).
UniProt ‘chain’ IDs, with nsp3 highlighted. These do not get distinct accessions and do not get treated in the same way as a ‘full’ accessioned protein entry
Unfortunately these chain IDs are not quite first-class citizens in the protein database world. For example, the fantastic InterproScan pipeline is executed on the polyproteins, not the chain IDs. This means that domain and GO function calls are made at the level of the polyprotein, so it looks to a machine like there is one super-multifunctional protein that acts as a protease, ADP-ribose binding protein, autophagosome induction, etc. In one sense this is sort of true, but I don’t think it’s a very useful way of looking at protein function. It is more meaningful to assign the functions at the level of the individual cleavage products. It is possible to propagate the interproscan-assigned annotations down to the NSPs using the supplied coordinates, but it should not fall on consumers to do this extra processing step.
The not-quite-first-class status of these identifiers also causes additional issues. For example different ways to write the same ID (P0DTD1-PRO_0000449621 vs P0DTD1:PRO_0000449621 vs P0DTD1#PRO_0000449621 vs simply PRO_0000449621), and no standard URL (although UniProt is working on these issues).
The identical twin identifier problem
An additional major problem is the existence of two distinct identifiers for each of the initial non-structural proteins. Of course, we live with multiple identifiers in bioinformatics all the time, but we generally expect a single database to assign a single identifier for a single entity. Not so!
The problem here is the fact there is a ribosomal frameshift in the translation of the polyprotein in SARS-CoV-2 (again, the biology here is fairly basic), which necessitates two distinct database entries; here: each (called 1ab; aka P0DTD1 and 1a; aka P0DTC1). So far so good. However, while these are truly distinct polyproteins, the non-structural proteins cleaved from them are identical up until the frameshift. However, due to an assumption in databases that each cleavage product must have one ‘parent’, IDs are duplicated. This is shown in the following diagram:
Two polyproteins 1ab and 1a (this is shown an 1a and 1b here, in fact the 1ab pp covers both ORFs). Each nsp 1-10 gets two distinct IDs depending on the ‘parent’ despite sequence identity, and as far as we know, the same function. Diagram courtesy of ViralZone/SIB
While on the surface this may seem like a trivial problem with some easy workarounds, in fact this representation breaks a number of things. First it artificially inflates the proteome making it seems there are more proteins than they actually are. A parts list is less useful if it has to be post-processed in ad-hoc ways to get the ‘true’ parts list.
It can make it difficult when trying to promote the use of standard database identifiers over protein symbols because an arbitrary decision must be made, and if I make a different arbitrary decision from you, then our data does not automatically integrate. Ironically, using standard protein symbols like ‘nsp3’ may actually be better for database integration than identifiers designed for that purpose!
And when curating something like a protein interaction database or a pathway database an orthology database or assembling a COVID Knowledge Graph that deals with pairwise interactions, we must either choose arbitrarily or fully populate the cross-product of all pair combos. E.g. if nsp3 in SARS is orthologous to nsp3 in SARS-CoV-2, then we have to make four statements instead of one.
While I focused on UniProt IDs here, other major resources such as NCBI also have these duplicates in their protein database for the sam reason.
Kudos to Wikidata and the Protein Ontology
Two resources I have seen that gets this right are the Protein Ontology and Wikidata.
The Protein Ontology (aka PR, sometimes known as PRO; NOT to be confused with ‘PRO’ entries in UniProt) includes a single first-class identifier/PURL for each nsp, for example nsp3 has CURIE PR:000050272 (http://purl.obolibrary.org/obo/PR_000050272). It has mappings to each of the two sequence-identical PRO chain IDs in UniProt. It also has distinct entries for the parent polyprotein, and it has meaningful ontologically encoded edges linking the two (SARS-CoV-2 protein ontology available from https://proconsortium.org/download/development/pro_sars2.obo)
Protein Ontology entry for nsp3Protein Ontology entry for SARS-CoV-2 nsp3, shown in obo format syntax (obo format is a human-readable concrete syntax for OWL).
Wikidata also does a good job of providing a single canonical identifier that is 1:1 with distinct proteins encoded by the SARS-CoV-2 genome (for example, the entry for nsp3 https://www.wikidata.org/wiki/Q87917581). However, it is not as complete. Sadly it does not have mappings to either the protein ontology or the UniProt PRO chain IDs (remember: these are different!).
The big disadvantage of Wikidata and the Protein Ontology over the big major sequence databases is that they are not the big major sequence databases. They suffer a curation lag (one employing crowdsourcing, the other manual curation) whereas the main protein databases automate more albeit at the expense of quirks such as non-first-class IDs and duplicate IDs. Depending on the use case, this may not be a problem. Due to the importance of the SARS-CoV-2 proteome, sufficient resources were able to be marshalled on this effort. But will this scale if we want unique non-dupplicate IDs for all proteins in all coronaviruses – including all the different ones infecting bats and other hosts?
A compromise solution
When building KG-COVID-19 we needed to decide which IDs to use as canonical for SARS-CoV-2 genes and proteins. While our system is capable of working with alternate IDs (either normalizing during the KG build stage, or post build as part of a clique-merge step), it is best to avoid these. Mapping IDs can lead to either unintentional roll-ups (information about the cleavage product propagating up to the polyprotein) or worse, fanning-out (rolled up information then spreading to ‘sibling’ proteins); or if 1:1 is enforced the overall system is fragile.
We liked the curation work done by the Protein Ontology, but we knew (1) we needed a system that we could instantly get IDs for proteins in any other viral genome (2) we wanted to be aligned with sources we were ingesting, such as the IntAct curation of the Gordon et al paper, and Reactome plus GO-CAM curation of viral-host pathways. This necessitating the choice of a major database.
Working with the very helpful UniProt team in concert with IntAct curators we were able to ascertain that of the duplicate entries, by convention we should take the ID that comes from the longer polyprotein as the ‘reference’. For example, nsp3 has the following chain IDs:
P0DTC1-PRO_0000449637 (from the shorter pp: 1a) [NO]
P0DTD1-PRO_0000449621 (from the longer pp: 1ab) [YES]
(Remember, these are sequence-identical and as far as we know functionally identical).
In this case, we take PRO_0000449621 as the canonical/reference entry. This is also the entry IntAct use to curate interactions. We pretend that PRO_0000449637 does not exist.
This is very far from perfect. Biologically speaking, it’s actually the shorter pp that is more commonly expressed, so the choice of the longer one is potentially confusing. These is also the question of how UniProt should propagate annotations. It is valid to propagate from one chain ID to its ‘identical twin’. But what about when these annotations reference other cleavage products (e.g pairwise functional annotation from a GO-CAM, or an ortholog). Do we populate the cross-product? This could get confusing (my interest in this was both from the point of view of our COVID KG, but also wearing my GO hat)
Nevertheless this was the best compromise we could find, and we decided to follow this convention.
Working with the UniProt and IntAct teams we also came up with a standard way to write IDs and PURLs for the chain IDs (CURIEs are of the form UniProtKB:ACCESSION-PRO_NNNNNNN). While this is not the most thrilling or groundbreaking development in the battle against coronaviruses, it excites me as it means we have to do far less time consuming and error prone identifier post-processing just to make data link up.
As part of the KG-COVID-19 project, Marcin Joachimiak coordinated the curation of a canonical UniProt-centric protein file (available in our GitHub repository), leveraging work that had been done by UniProt, the protein ontology curators, and the SciBite ontology team. We use UniProt IDs (either standard accessions, for polyproteins, structural proteins, and accessory ORFs; or chain IDs for NSPs) This file differs from the files obtained directly from UniProt, as we include only reference members of nsp ‘twins’, and we exclude less meaningful cleavage products.
This file lives in GitHub (we accept Pull Requests) and serves as one source for building our KG. The information is also available in KGX property graph format, or as RDF, or can be queried from our SPARQL endpoint.
We are also coordinating with different groups such as COVIDScholar to use this as a canonical vocabulary for text mining. Previously groups performing concept recognition on the CORD-19 corpus using protein databases as dictionaries missed the non-structural proteins, which is a major drawback.
Imagine a world
In an ideal world posts like this would never need to be written. There is no panacea; however, systems such as the Protein Ontology and Wikidata which employ an ontologically grounded flexible graph make it easier to work around legacy assumptions about relationships between types of molecular parts (see also the feature graph concept from Chado). The ontology-oriented basis makes it easier to render implicit assumptions explicit, and to encode things such as the relationship between molecular parts in a computable way. Also embracing OBO principles and reusing identifiers/PURLs rather than minting new ones for each database could go some way towards avoiding potential confusion and duplication of effort.
I know this is difficult to conceive of in the current landscape of bioinformatics databases, but paraphrasing John Lennon, I invite you to:
Imagine no proliferating identifiers I wonder if you can No need for mappings or normalization A federation of ann(otations) Imagine all the people (and machines) sharing all the data, you
You may say I’m a dreamer But I’m not the only one I hope some day you’ll join us And the knowledge will be as one
In the OBO world, most of our ontologies are developed by and for English-speaking audiences. We would love to have translations of labels, synonyms, definitions, and so on in other languages. However, we lack the resources to do this (an exception is the HPO, which includes high quality curated translations for many languages).
Wikipedia/Wikidata is an excellent source of crowdsourced language translations. Wikidata provides language-neutral concept IDs that link multiple language-specific Wikipedia pages. Wikidata also includes mappings to ontology class IDs, and provides a SPARQL endpoint. All this can be leveraged for a first pass at language translations.
For example, the Wikidata entity for badlands is mapped to the equivalent ENVO class PURL. This entity in Wikidata also has multiple rdfs:label annotations (maximum one per language).
We can query Wikidata for all rdfs:label translations for all classes in ENVO. I will use the sparqlprog_wikidata framework to demonstrate this:
pq-wikidata ‘envo_id(C,P),label(C,N),Lang is lang(N)’
This compiles down to the following SPARQL which is then executed against the Wikidata endpoint:
Of course, different ontologies will differ in how their coverage maps to Wikidata. In some cases, ontologies will include many more concepts; or the corresponding Wikidata entities will have fewer or no non-English labels. But this will likely decrease over time.
There may be other ways to increase coverage. Many ontology classes are compositional in nature, so a combination of language translations of base classes plus language specific encodings of grammatical patterns could yield many more. The natural place to add these would be in the manually curated .yaml files used to specify ontology design patterns, through frameworks like DOSDP. And of course, there is a lot of research in Deep Learning methods for language translation. A combination of these methods could yield high coverage with hopefully good accuracy.
As far as I am aware, these methods have not been formally evaluated. Doing an evaluation will be challenging as it will require high-quality gold standards. Ontology developers spend a lot of time coming up with the best primary label for classes, balancing ontological correctness, elimination of ambiguity, understanding of usage of terms by domain specialists, and (for many ontologies, but not all) avoiding overly abstruse labels. Manually curated efforts such as the HPO translations would be an excellent start.
This is one post in a series of tips on ontology development, see the parent post for more details.
A common mistake is to over-specify an OWL definition (another post will be on under-specification). While not technically wrong, over-specification loses you reasoning power, limiting your ability to auto-classify your ontology. Formally, what I mean by over-specifying here is: stating more conditions than is required for correct entailments.
One manifestation of this anti-pattern is the over-specified genus. (this is where I disagree with Seppala et al on S3.1.1, use the genus proximus, see previous post). I will use a contrived example here, although there are many real examples. GO contains a class ‘Schwann cell differentiation’, with an OWL definition referencing ‘Schwann cell’ from the cell ontology (CL). I consider the logical definition to be neither over- nor under- specified:
‘Schwann cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’
We also have a corresponding logical definition for the parent:
‘glial cell differentiation’ EquivalentTo ‘cell differentiation’ and results-in-acquisition-of-features-of some ‘glial cell’
The Cell Ontology (CL) contains the knowledge that Schwann cells are subtypes of glial cells, which allows us to infer that ‘Schwann cell differentiation’ is a subtype of ‘glial cell differentiation’. So far, so good (if you read the post on Normalization you should be nodding along). This definition does real work for us in the ontology: we infer the GO hierarchy based on the definition and classification of cells in CL.
Now, imagine that in fact GO had an alternate OWL definition:
‘Schwann cell differentiation’ EquivalentTo ‘glial cell differentiation’ and results-in-acquisition-of-features-of some ‘Schwann cell’
This is not wrong, but is far less useful. We want to be able to infer the glial cell parentage, rather than assert it. Asserting it violates DRY (the Don’t Repeat Yourself principle) as we implicitly repeat the assertion about Schwann cells being glial cells in GO (when in fact the primary assertion belongs in CL). If one day the community decides that in fact that Schwann cells are not glial but in fact neurons (OK, so this example is not so realistic…), then we have to change this in two places. Having to change things in two places is definitely a bad thing.
I have seen this kind of genus-overspecification in a number of different ontologies; this can be a side-effect of the harmful misapplication of the single-inheritance principle (see ‘Single inheritance considered harmful’, a previous post). This can also arise from tooling limitations: the NCIT neoplasm hierarchy has a number of examples of this due to the tool they originally used for authoring definitions.
Another related over-specification is too many differentiae, which drastically limits the work a reasoner and your logical axioms can do for you. As a hypothetical example, imagine that we have a named cell type ‘hippocampal interneuron’, conventionally defined and used in the (trivial) sense of any interneuron whose soma is located in a hippocampus. Now let’s imagine that single-cell transcriptomics has shown that these cells always express genes A, B and C (OK, there are may nuances with integrating ontologies with single-cell data but let’s make some simplifying assumptions for now)/
It may be tempting to write a logical definition:
‘hippocampal interneuron’ EquivalentTo
interneuron AND
has-soma-location SOME hippocampus AND
expresses some A AND
expresses some B AND
expresses some C
This is not wrong per se (at least in our hypothetical world where hippocampal neurons always express these), but the definition does less work for us. In particular, if we later include a cell type ‘hippocampus CA1 interneuron’ defined as any interneuron in the CA1 region of the hippocampus, we would like this to be classified under hippocampal neuron. However, this will not happen unless we redundantly state gene expression criteria for every class, violating DRY.
The correct thing to do here is to use what is sometimes called a ‘hidden General Class Inclusion (GCI) axiom’ which is just a fancy way of saying that SubClassOf (necessary conditions) can be mixed in with an equivalence axiom / logical definition:
‘hippocampal interneuron’ EquivalentTo interneuron AND has-soma-location SOME hippocampus
‘hippocampal interneuron’ SubClassOf expresses some A
‘hippocampal interneuron’ SubClassOf expresses some B
‘hippocampal interneuron’ SubClassOf expresses some C
In a later post, I will return to the concept of an axiom doing ‘work’, and provide a more formal definition that can be used to evaluate logical definitions. However, even without a formal metric, the concept of ‘work’ is intuitive to people who have experience using OWL logical definitions to derive hierarchies. These people usually intuitively test things in the reasoner as they go along, rather than simply writing an OWL definition and hoping it will work.
Another sign that you may be overstating logical definitions is if they are for groups of similar classes, yet they do not fit into any design pattern template.
For example, in the above examples, the cell differentiation branch of GO fits into a standard pattern
cell differentiation andresults-in-acquisition-of-features-of some C
where C is any cell type. The over-specified definition does not fit this pattern.
Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:
The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.
RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.
You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g
<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .
This has a natural visual cognate:
We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.
But what about the 3rd item, semantics?
What about semantics?
For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.
OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?
Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:
The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.
Issues with existing OWL to RDF mapping
Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form
Finger SubClassOf Part_of some Hand
This is translated to RDF as
Finger owl:subClassOf [
a owl:Restriction ;
owl:onProperty :part_of
owl:someValuesFrom :Hand
]
As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:
Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.
This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.
In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.
I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.
It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!
OWL is not the only knowledge representation language
The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.
OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.
However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.
For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.
A proposed path forward for semantics in Property Graphs and RDF*
RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!
Here is example of how this might look for the axioms in the figure above in RDF*
I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.
In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:
proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.
Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.
So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.
The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.
I will call this mapping OWL*, and it may look something like this:
Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.
When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.
Bonus: Explicit deferral of semantics where required
Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:
Dense logical axioms proposed by Schulz & Jansen for representing biological interactions
this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:
protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database
And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.
I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.
RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).
Beyond FOL
There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.
or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.
We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:
<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .
which could be visually represented as:
Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.
Proposed Approach
What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.
The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.
All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!
It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:
Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata
A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.
Glossing over the details
For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!
obographviz converts ontologies (in OBO Graph JSON) to dot/graphviz with powerful JSON stylesheet configuration. The JSON style sheets allow for configurable color and sizing of nodes and edges. While graphviz can look a bit static and clunky compared to these newfangled javascript UI libraries, I still find it the most useful way to visualize complex dense ontology graphs.
One of the most challenging things to visualize is the parallel structure of multiple ontologies covering a similar area (e.g. multiple anatomy ontologies, or phenotype ontologies, each covering a different species).
This new release allows for the nesting of equivalence cliques from either xrefs or equivalence axioms. This can be used to visualize the output of algorithms such as our kBOOM (Bayesian OWL Ontology Merging) tool. Starting with a defined set of predicates/properties (for example, the obo “xref” property, or an owl equivalentClasses declaration between two classes), it will transitively expand these until a maximal clique is found for each node. All such mutually connected nodes with then be visualized inside one of these cliques.
An example is shown below (Uberon (yellow) aligned with ZFA (black) and two Allen Brain Atlas ontologies (grey and pink); each ontology has its own color; any set of species-equivalent nodes are clustered together with a box drawn around them. isa=black, part_of=blue). Classes that are unique to a single ontology are shown without a bounding box.