Biological Knowledge Graph Modeling Design Patterns

 

This document provides an overview of two modeling strategies/patterns used for building knowledge graphs and triplestores of core biological ‘knowledge’ (e.g. relations between genes, chemicals, diseases, environments, phenotypes, diseases, variants). I call these patterns Knowledge Graph Modeling and OWL (aka logical) modeling. These are complementary and can work together, but I have found it to be useful to always be aware of the ‘mode’ one if working in.

 

I don’t have a formal definition of ‘knowledge graph’. I realize it is in part a marketing term, but I think there are some key features that are commonly associated with KGs that may distinguish them from the way I have modeled things in RDF/OWL. In particular KGs are more commonly associated with property graphs and technologies such as Neo4J, and naturally accommodate information on edges (not just provenance, but things that have a semantic impact). In contrast, RDF/OWL modeling will more commonly introduce nodes for these, and place these nodes in the context of an ontology.

 

I found this slide to be a pretty useful definition of the salient features of a KG (slide from Uber’s Joshua Shinavier from this week’s US2TS meeting):

 

https://twitter.com/mrvaidya/status/1105154093180432386

uber

type and identity of each vertex and edge meaningful to both humans and software; emphasize human understanding; success of graph data models has much to do with psychology; sub-symbolic data sets e.g. ML models are not KGs. KGs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume (Paul Groth)

 

Some other thoughts on KG from members of the semantic web community:

 

Here, rather than focusing on a definition I attempt to identify two clusters of modeling patterns. I have found this to be useful for some of the work we have done on different biological data integration, curation, and ontology projects. In particular, for the NCATS Translator project, one of the things we are working on is merging multiple KGs from multiple different teams, where different teams use different technologies (e.g. Neo4J and Triplestores) and where each team builds KGs with different purposes.

I am curious how well these translate to different domains (if at all). The life sciences may be unusual in having so many named entities such as genes and drugs that are in a quantum superposition of being instance-like, named, countable things while at the same time being class-like, repeated things that vary in their instantiation according to context. This ends up having a big impact on data modeling.

cat

Genes have Schrodinger’s cat qualities, with class-like characteristics and instance-like characteristics, depending on how you look at it

Knowledge Graph Modeling Features and Patterns

Rather than start with a definition, I give as an illustrative example a graphic of a schema from a Neo4J database of biological entities (from this tweet from Daniel Himmelstein)

C0E8_9xWIAEXeKr

Simple Rules of KGM

  1. Is knowledge represented as a graph in some meaningful way. Any old conversion of data to a neo4j database or RDF does not count. It should be meaningfully connected, with traversals allowing us to broadly see the connectedness of some piece of biology. It should be more than just an ontology, and should include connections between the named entities in the domain. This is not a formal definition: like art, I know it when I see it.
  2. Each node in the graph should correspond to some named thing in the domain; name here applies to either a human-friendly name or recognized databased entity. for example, ‘human Shh’, ‘Fanconi anemia’, ‘metformin’, ‘glial cell’, ‘patient123’, rs34778348
  3. Edges connecting nodes must have a relationship type. (e.g. ‘treats’, ‘has phenotype’, ‘located in’)
  4. Edges should form sentences that are meaningful to a domain scientist or clinician (e.g. ‘ibuprofen treats headache’, ‘Parkinson disease has-feature Tremor’, ‘nucleus part-of cell’)
  5. Inference framework neutral. Inference frameworks include logical deductive reasoning, probabilistic inference, ad-hoc rules. A KG may include edges with probabilities attached with the intent of calculating the probability of subgraphs using the chain rule; or it may include logical quantifiers; or none of the above, and may instead be intended to loosely specify a piece of knowledge (e.g. a classic semantic network)
  6. Commitment to precise logical semantics are not important at this level. This is partially a restatement of the previous rule. Specifically: we do not necessarily care whether ‘ibuprofen’ or ‘human Shh’ is an owl class or instance (it’s just a node), and we do not require ontological commitment about logical quantification on edges.
  7. Edges can have additional information attached. This includes both generic metadata (provenance, evidence) and also biologically important information. E.g. penetrance for a gene-phenotype edge; positional info for a gene-chromosome association. It can also include logical qualifiers and additional semantics, probabilities, etc. There may be different mechanisms for attaching this information (for neo4j, property graphs; for RDF, named graphs or reification), the mechanism is not so important here.
  8. Graph theoretic operations do useful work. E.g. shortest path between nodes. Spreading activation, random walks. Also knowledge graph machine learning techniques, such as those based off of node embeddings, e.g. Knowledge Graph Completion.
  9. Modeling should follow standard documented design patterns. Relationship types should be mapped to an ontology such as RO or SIO. In the NCATS Translator project, we specify that Entity types and Association types should be catalogued in biolink-model
  10. Ontology relationships modeled as single edges. KGMs frequently include ontologies to assist traversal. Some OWL axioms (e.g. Nucleus SubClassOf part-of some Cell) are encoded as multiple RDF triples – these must be converted to single edges in a KG. Optionally, the specific semantics (i.e OWL quantifier) can be added as an edge property if a property graph is used. See the Neo4J mapping for OWL we developed in SciGraph.
  11. A slim ontology of high level upper ontology classes is used for primary classification. Due to de-emphasis on reasoning it is useful to have a primary classification to a small set of classes like gene, protein, disease, etc. In Neo4j these often form the ‘labels’. See the  biolink-model node types. The forthcoming OBO-Core project is also a good candidate. Detailed typing information can also be added.

Examples of KGM

 

Advantages/Disadvantages of KGM

  • Advantage: simplicity and intuitiveness
  • Advantage: visualization
  • Advantage: direct utilization of generic graph algorithms for useful purposes (e.g. shortest path)
  • Advantage: lack of ontological commitment makes agreement on standards like biolink-model easier
  • Disadvantage: less power obtained from OWL deductive reasoning (but transforms are possible, see below)
  • Disadvantage: becomes awkward to model contextual statements and more complex scenarios (e.g. GO-CAMs)

 

OWL (Logical) Modeling Features and Patterns

Note the assumption here is that we are modeling the full connections between entities in a domain (what are sometimes called annotations). For developing ontologies, I assume that direct modeling as an OWL TBox using OWL axioms is always best.

Principles of logical modeling

  1. Classes and instances assumed distinct. Punning is valid in OWL2, and is sometimes unavoidable when following a KG pattern layered on RDF/OWL) but I consider it’s use in a logical modeling context a bad smell.
  2. Many named bio-entities modeled as classes. Example: ‘human Shh gene’, ‘Fanconi anemia’, ‘metformin’, ‘nucleus’; even potentially rs34778348, But not: ‘patient123’.
  3. Classes and Class-level knowledge typically encoded in ontologies within OBO library or equivalent. Example: PD SubClassOf neurodegenerative disease; every nucleus is part of a cell; every digit is part-of some autopod; nothing is part of both a nucleus and a cytoplasm. There are multiple generally agreed upon modeling principles, and general upper ontology agreement here.
  4. Instances and instance-level knowledge typically encoded OUTSIDE ontologies. Example: data about a patient, or a particular tissue sample (although this is borderline, see for example our FANTOM5 ontology)
  5. OWL semantics hold. E.g. if an ontology says chemical A disjoint-with chemical B, and we have a drug class that is a subclass of both, the overall model is incoherent. We are compelled to model things differently (e.g. using has-part)
  6. ‘Standard Annotations’ typically modeled as some-some. The concept of ‘ontology annotation’ in biocuration is typically something like assigning ontology terms to entities in the domain (genes, variants, etc). In the default case ‘annotations’ are assumed to not hold in an all-some fashion. E.g. if we have a GO annotation of protein P to compartment C, we do not interpret as every instance of P being part of some instance of C. A safe default modeling assumption is some-some, but it is also possible to model in terms of dispositions (which is essentially how the NCIT ontology connects genes to processes and diseases). Note that when all-some is used for modeling we get into odd situations such as interaction relationships needing to be stated reciprocally. See Lmn-2 interacts with Elf-2. On the meaning of common statements in biomedical literature by Stefan Shulz and Ludger Jansen for an extended treatment. Note that in KG modeling, this entire issue is irrelevant.
  7. Reification/NGs typically reserved for metadata/provenance. Reification (e.g. using either rdf or owl vocabs) reserved for information about the axiom. The same holds of annotating named graphs. In either case, the reified node or the NG is typically not used for biological information (since it would be invisible to the reasoner). Reification-like n-ary patterns may be used to introduce new biological entities for more granular modeling.
  8. Instances typically introduced to ensure logical correctness. A corollary of the above is that we frequently introduce additional instances to avoid incorrect statements. For example, to represent a GO cell component annotation we may introduce an instance p1 of class P and an instance p2 of class C, and directly connect p1 and p2 (implicitly introducing a some-some relationship between P and C). See below for examples.
  9. Instances provide mechanism for stating context. As per previous rule, if we have introduced context-specific instances, we can arbitrarily add more properties to these. E.g. that p1 is phosphorylated, or p1 is located in tissue1 which is an epithelium.
  10. Introduced instances should have IRIs minted. Blank nodes may be formally correct, but providing URIs has advantages in querying and information management. IRIs may be hashed skolem terms or UUIDs depending on the scenario. We informally designate these as ‘pseudo-anonymous’, in that they are not blank nodes, but share some properties (e.g. typically not assigned a rdfs:label, their URIs does not correspond 1:1 to a named entity in the literature). Note 1: we use the term ‘introduced instance’ to indicate an instance created by the modeler, we assume for example ‘patient123’ already has an IRI. Note 2: OWL axioms may translate to blank nodes as mandated by the OWL spec.
  11. Deductive reasoning performs useful work. This is a consequence of OWL-semantics holding. Deductive (OWL) reasoning should ‘do work’ in the sense of providing useful inferences, either in the form of model checking (e.g. in QC) or in the ability to query for implicit relationships. If reasoning is not performing useful work, it is a sign of ‘pseudo-precision’ or overmodeling, and that precise OWL level modeling may not be called for and a simpler KGM may be sufficient (or that the OWL modeling needs changed).

 

Advantages/Disadvantages of Logical Modeling

  • Advantage: formal correctness and coherency
  • Advantage: reasoning performs useful work
  • Advantage: representing contextual statements naturally
  • Advantage: changing requirements resulting in additional granularity or introduction of context can be handled gracefully by additing to existing structures
  • Disadvantage: Additional nodes and edges in underlying RDF graph
  • Disadvantage: Impedance mismatch when using neo4j or assuming the underlying graph has properties of KGM (e.g. hopping from one named entity to another)

Example: phenotypes

Consider a simple KG for connecting patients to phenotypes. We can make edges:

  • Patient123 rdf:type Human (NCBITaxon class)
  • Patient123 has-phenotype ‘neuron degeneration’ (HP class)
  • Patient123 has-phenotype ‘tremor’ (HP class)
  • Etc

(OWL experts will immediately point out that this induces punning in the OWL model; the Neo4j modeler does not know or care what this is).

Now consider the scenario where we want to produce additional contextual info about the particular kind of neuron degeneration, or temporal information about the tremors; and the ontology does not pre-coordinate the terms we need.

One approach is to add additional properties to the edge. E.g. location, onset. This is often sufficient for simple use cases, clients can choose to ask for additional specificity when required. However, there are advantages to putting the context on the node. Note of course that it is not correct to add an edge

  • ‘neuron degeneration’ has-location ‘striatum’

Since we want to talk about the particular ‘neuron degeneration’ happening in the context of patient123. This is where we might want to employ instance-oriented OWL modeling. The pattern would be

  • Patient123 rdf:type Human
  • Patient123 has-phenotype :p1
  • :p1 rdf:type ‘neuron degeneration’
  • :p1 located-in :l1
  • :l1 rdf:type ‘striatum’
  • Patient123 has-phenotype …

This introduces more nodes and edges, but gives a coherent OWL model that can do useful work with reasoning. For example, if a class ‘striatal neuron degeneration’ is later introduced and given an OWL definition, we infer that Patient123 has this phenotype. Additionally, queries for example for ‘striatal phenotypes’ will yield the correct answer.

Hybrid Modeling

It is possible to mix these two modes. We can treat the KG layer as being ‘shortcuts’ that optionally compile down to more granular representations. Also, the KG layer can be inferred via reasoning from the more granular layer. Stay tuned for more posts on these patterns…

Advertisements

OntoTip: Lift/Borrow/Steal Software Engineering Principles

This is one post in a series of tips on ontology development, see the parent post for more details.

The main premise of this piece is that ontology developers can learn from the experience of software engineers. Ontologists are fond of deriving principles based on abstract concepts or philosophical traditions, whereas more engineering-oriented principles such as those found in software engineering have been neglected, which is to our detriment.

Screen Shot 2019-03-09 at 1.31.30 PM

Figure: An appreciation of engineering practice and in particular software development principles is often overlooked by ontologists.

 

In its decades-long history, software development has matured, encompassing practices such as modular design, version control, design patterns, unit testing, continuous integration, and a variety of methodologies , from waterfall top-down design through to extreme and agile development.  Many of these are relevant to ontology development, more than you might think. Even if a particular practice is not directly applicable, knowledge of it can help; thinking like a software engineer can be useful. For example, most good software engineers have internalized the DRY principle (Don’t Repeat Yourself), and will internally curse themselves if they end up duplicating chunks of code or logic as expedient hacks. They know that they are accumulating technical debt (for example, necessitating parallel updates in multiple places). The DRY principle and the DRY way of thinking should also permeate ontology development. Similarly, software developers cultivate a sense of ‘code smell’,  and will tell you if a piece of code has a ‘bad smell’ (naturally, this is always code written by someone else).

Screen Shot 2019-03-09 at 1.33.40 PM

Don’t worry if you don’t have any experience programming, as the equivalent ontology engineering practices and intuitions can be learned through training, experience, the use of appropriate tools, and sharing experience with others. Unfortunately, tools are not yet as mature for ontology engineering as they are for software, but we are trying to address this with the Ontology Development Kit, an ongoing project to provide a framework for ordinary ontology developers to apply standard engineering principles and practice.

Computer scientists and software engineers are also fortunate in having a large body of literature covering their discipline in a holistic fashion; this includes classics such as The Mythical Man Month, The “Gang of Four” Design Patterns book, Martin Fowler’s blog and his book Refactoring. While there are good textbooks on ontologies these tend to be less engineering-focused, at best analogs of the (excellent, but sometimes theoretical) Structure and Interpretation of Computer Programs. Exceptions include the excellent, practical engineering-oriented ontogenesis blog.

An incomplete list of transferrable software concept, principles, and practice includes:

 

I hold that all of the above are either directly transferrable or have strong analogies with ontology development. I hope to expand on many of these on this blog and other forums, and encourage others to do so.

Also, I can’t emphasize strongly enough that I’m not saying that engineering principles are more important than other attributes such as an understanding of a domain. Obviously an ontology constructed with either insufficient knowledge of the domain or inattention to users of the ontology will be rubbish. My point is just that a little time spent honing the skills and sensibilities described here can potentially go a very long way to improving sustainability and maintenance of ontologies.

OntoTips: A series of assorted ontology development guidelines

I am planning a series of blog posts describing some general tips I have found useful in working with groups developing biological ontologies. These are not intended to be rigid rules enforced by ontological high priests; they are intended to provide empirically backed  assistance to ontology developers of different levels of abilities, based on lessons learned in the trenches. I’m going to attempt to stay away from both hand-waving abstraction and platitudinous obvious truths, and focus on concrete practical examples of utility to ontology developers. I also hope to generate constructive and interesting discussion around some of the more controversial recommendations.

Screen Shot 2019-03-09 at 1.28.47 PM

The following is a list of tips I intended to cover. I will add hyperlinks as I write each individual tip. I’m not sure when I will finish writing all articles so apologies for any teasers.

Checking ontologies using ROBOT report (with an example from the cephalopod ontology)

ROBOT is an OBO Tool for use in ontology development pipelines, supported by the Open Bio Ontologies services project. If you set up your ontology repository using the Ontology Development Kit then you will automatically get a workflow that uses ROBOT to perform reasoning and release processes on your ontology. I have previously covered use of ROBOT for reasoner-based validation.

Starting with v1.20 of ROBOT, we now include a new report command. You can call this command to get a report card, giving a health check on multiple aspects of your ontology.

ROBOT vs Cephalopod

You can read the online documentation here, let’s take a look at it in action. We’ll try a report on the cephalopod ontology.

Like most robot commands, the input can either be a local file (specified with –input or –i) or a URL (specified with –input-iri or -I).
robot report -I http://purl.obolibrary.org/obo/ceph.owl

This will return a textual report, broken down into a summary and detail section. Here the summary section starts:


Violations: 236
-----------------
ERROR: 8
WARN: 211
INFO: 17

236 violations, bad cephalopod! The violations are broken into 3 categories: ERROR, WARN and INFO. The most serious ones (ERROR) are listed first:


Level Rule Name Subject Property Value
ERROR duplicate_definition head-mantle fusion [CEPH:0000129] definition [IAO:0000115] .
ERROR duplicate_definition tentacle thickness [CEPH:0000261] definition [IAO:0000115] .
ERROR missing_label anatomical entity [UBERON:0001062] label [rdfs:label]
ERROR missing_ontology_description ceph.owl dc11:description
ERROR duplicate_label leucophore [CEPH:0000284] label [rdfs:label] leucophore
ERROR duplicate_label leucophore [CEPH:0001077] label [rdfs:label] leucophore
ERROR missing_ontology_license ceph.owl dc:license
ERROR missing_ontology_title ceph.owl dc11:title

Duplicate definition is considered a serious problem, since every class in an OBO ontology should be unique, and two classes with the same definition indicates the meaning is the same and the classes should be merged.

Similarly, each class should have a label, otherwise a human has no idea what it is. Note in this case the missing label is on an imported class rather than one native to the ontology itself. Duplicate labels are considered serious, since each class in an ontology should have a unique label, so humans can tell them apart.

Some of the violations are due to missing metadata. Here there is no description of the ontology, no license declared and no title. In OBO we strive to have complete metadata for everyone ontology.

After the ERRORs are the WARNings, for example:


WARN annotation_whitespace accessory gland complex [CEPH:0000004] definition [IAO:0000115] The glandular complex in cirrates that forms sperm packets and is a counterpart of the spermatophore-forming complex of other cephalopods.
WARN annotation_whitespace armature of the arms [CEPH:0000018] definition [IAO:0000115] The grappling structures on the oral surfaces the arms and tentacles, including both suckers and hooks.
WARN annotation_whitespace iteroparous [CEPH:0001067] comment [rdfs:comment] Nautiluses are the only cephalopods to present iteroparity

Annotation whitespace means that there is either trailing, leading, or additional internal whitespace (sorry, fans of two spaces after a period) in the value used in an annotation assertion (remember, ROBOT uses OWL terminology, and here “annotation” means a non-logical axiom, typically a piece of textual metadata used by humans). The annotation property is also reported. It helps to memorize a few IAO properties, e.g IAO:0000115 is the property for (text) definition.

These kinds of issues are not usually harmful, but they can confound some lexical operations.

Currently WARN includes a few false positives, e.g.


WARN multiple_asserted_superclasses anal flap [CEPH:0000009] rdfs:subClassOf valve [UBERON:0003978]
WARN multiple_asserted_superclasses anal flap [CEPH:0000009] rdfs:subClassOf ectoderm-derived structure [UBERON:0004121]

It should be stressed that there is NOTHING WRONG with multiple inheritance in an ontology. It is good engineering practice to assert at most one in the editors version and infer the rest, but as far as what the end-user sees it is all the same. In future this check should only apply to the editors version and not to the released or post-reasoned version.


WARN missing_definition transverse row of suckers [CEPH:0000306] definition [IAO:0000115]
WARN missing_definition cephalopod sperm receptacle [CEPH:0001017] definition [IAO:0000115]
WARN missing_definition inner sucker ring [CEPH:0001020] definition [IAO:0000115]
WARN missing_definition circumoral appendage bud [CEPH:0000003] definition [IAO:0000115]

This one is a more serious problem. One of the OBO principles is that classes should have definitions for human users. Without definitions, we must rely on the label or placement in the graph to intuit meaning, and this is often unreliable.

While text definitions are incredibly important, we only count this as a WARNing because currently even well curated ontologies frequently miss some definitions, having every class defined is a high bar.

Following the warnings are the INFOs, for example


INFO lowercase_definition anal photophore [CEPH:0000304] definition [IAO:0000115] photophore at side of anus

This is warning us that the class for anal photophore has a definition that is all lowercase. This is not particularly serious. However, having inconsistent lexical rules in an ontology can look unprofessional, and in OBO we have the convention that all text definitions have the first word capitalized. Many ontologies such as GO end with a period.

Use in ontology pipelines

The report command can be used in ontology pipelines to “fail fast” if the report produces ERRORs. The default behavior is to exit with a non-zero return code (standard in unix for a command that fails).

This can be configured – for example, if you wish to create a report for informative purposes, but you are not yet up to the standard that ROBOT imposes. You can also set ROBOT to be more strict by failing if a single warning is found, like this:

robot report --input cephalopod.owl \
  --fail-on WARN \
  --output report.tsv

Specify “fail-on none” if you never want the report command to fail, no matter what it finds.

As the output is TSV it’s easy to load into Excel or Google sheets (if you like that sort of thing), to transform to a Wiki table, etc. You can also get the report as YAML.

How does this work?

Each type of check in the report is implemented as a SPARQL query. This makes the framework very declarative, transparent, and easy to extend. Many ontology editors who do not code can read SPARQL queries, which makes it easier to figure out which each check is doing.

You can see the queries in the ROBOT repository.

Who decides on the rules?

Hold on, you may be thinking. Who gave these ROBOT people the right to decide what counts as an error? Maybe you have duplicate definitions and you think that is just fine.

This should not prevent you from using ROBOT, as everything is customizable. You can create your own profile where duplicate definitions are not errors.

However, our overall goal is to come up with a set of checks that are agreed across the OBO community. We have made initial decisions based on our combined experience with diverse ontologies, but we want community feedback. Please give us feedback on the ROBOT GitHub issue tracker!

Ultimately we plan to use the default profile to automatically evaluate ontologies within the OBO library, and provide assistance to reviewers of ontologies.

Acknowledgments

The report command and accompanying queries were written by Becky Tauber. Thanks also to James Overton, HyeongSik Kim, and members of the OBO community who provided specifications, directions, and feedback. In particular many checks came from the set developed originally for the Gene Ontology. ROBOT is supported by the OBO services grant to Bjorn Peters and myself.