Biological Knowledge Graph Modeling Design Patterns


This document provides an overview of two modeling strategies/patterns used for building knowledge graphs and triplestores of core biological ‘knowledge’ (e.g. relations between genes, chemicals, diseases, environments, phenotypes, diseases, variants). I call these patterns Knowledge Graph Modeling and OWL (aka logical) modeling. These are complementary and can work together, but I have found it to be useful to always be aware of the ‘mode’ one if working in.


I don’t have a formal definition of ‘knowledge graph’. I realize it is in part a marketing term, but I think there are some key features that are commonly associated with KGs that may distinguish them from the way I have modeled things in RDF/OWL. In particular KGs are more commonly associated with property graphs and technologies such as Neo4J, and naturally accommodate information on edges (not just provenance, but things that have a semantic impact). In contrast, RDF/OWL modeling will more commonly introduce nodes for these, and place these nodes in the context of an ontology.


I found this slide to be a pretty useful definition of the salient features of a KG (slide from Uber’s Joshua Shinavier from this week’s US2TS meeting):


type and identity of each vertex and edge meaningful to both humans and software; emphasize human understanding; success of graph data models has much to do with psychology; sub-symbolic data sets e.g. ML models are not KGs. KGs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume (Paul Groth)


Some other thoughts on KG from members of the semantic web community:


Here, rather than focusing on a definition I attempt to identify two clusters of modeling patterns. I have found this to be useful for some of the work we have done on different biological data integration, curation, and ontology projects. In particular, for the NCATS Translator project, one of the things we are working on is merging multiple KGs from multiple different teams, where different teams use different technologies (e.g. Neo4J and Triplestores) and where each team builds KGs with different purposes.

I am curious how well these translate to different domains (if at all). The life sciences may be unusual in having so many named entities such as genes and drugs that are in a quantum superposition of being instance-like, named, countable things while at the same time being class-like, repeated things that vary in their instantiation according to context. This ends up having a big impact on data modeling.


Genes have Schrodinger’s cat qualities, with class-like characteristics and instance-like characteristics, depending on how you look at it

Knowledge Graph Modeling Features and Patterns

Rather than start with a definition, I give as an illustrative example a graphic of a schema from a Neo4J database of biological entities (from this tweet from Daniel Himmelstein)


Simple Rules of KGM

  1. Is knowledge represented as a graph in some meaningful way. Any old conversion of data to a neo4j database or RDF does not count. It should be meaningfully connected, with traversals allowing us to broadly see the connectedness of some piece of biology. It should be more than just an ontology, and should include connections between the named entities in the domain. This is not a formal definition: like art, I know it when I see it.
  2. Each node in the graph should correspond to some named thing in the domain; name here applies to either a human-friendly name or recognized databased entity. for example, ‘human Shh’, ‘Fanconi anemia’, ‘metformin’, ‘glial cell’, ‘patient123’, rs34778348
  3. Edges connecting nodes must have a relationship type. (e.g. ‘treats’, ‘has phenotype’, ‘located in’)
  4. Edges should form sentences that are meaningful to a domain scientist or clinician (e.g. ‘ibuprofen treats headache’, ‘Parkinson disease has-feature Tremor’, ‘nucleus part-of cell’)
  5. Inference framework neutral. Inference frameworks include logical deductive reasoning, probabilistic inference, ad-hoc rules. A KG may include edges with probabilities attached with the intent of calculating the probability of subgraphs using the chain rule; or it may include logical quantifiers; or none of the above, and may instead be intended to loosely specify a piece of knowledge (e.g. a classic semantic network)
  6. Commitment to precise logical semantics are not important at this level. This is partially a restatement of the previous rule. Specifically: we do not necessarily care whether ‘ibuprofen’ or ‘human Shh’ is an owl class or instance (it’s just a node), and we do not require ontological commitment about logical quantification on edges.
  7. Edges can have additional information attached. This includes both generic metadata (provenance, evidence) and also biologically important information. E.g. penetrance for a gene-phenotype edge; positional info for a gene-chromosome association. It can also include logical qualifiers and additional semantics, probabilities, etc. There may be different mechanisms for attaching this information (for neo4j, property graphs; for RDF, named graphs or reification), the mechanism is not so important here.
  8. Graph theoretic operations do useful work. E.g. shortest path between nodes. Spreading activation, random walks. Also knowledge graph machine learning techniques, such as those based off of node embeddings, e.g. Knowledge Graph Completion.
  9. Modeling should follow standard documented design patterns. Relationship types should be mapped to an ontology such as RO or SIO. In the NCATS Translator project, we specify that Entity types and Association types should be catalogued in biolink-model
  10. Ontology relationships modeled as single edges. KGMs frequently include ontologies to assist traversal. Some OWL axioms (e.g. Nucleus SubClassOf part-of some Cell) are encoded as multiple RDF triples – these must be converted to single edges in a KG. Optionally, the specific semantics (i.e OWL quantifier) can be added as an edge property if a property graph is used. See the Neo4J mapping for OWL we developed in SciGraph.
  11. A slim ontology of high level upper ontology classes is used for primary classification. Due to de-emphasis on reasoning it is useful to have a primary classification to a small set of classes like gene, protein, disease, etc. In Neo4j these often form the ‘labels’. See the  biolink-model node types. The forthcoming OBO-Core project is also a good candidate. Detailed typing information can also be added.

Examples of KGM


Advantages/Disadvantages of KGM

  • Advantage: simplicity and intuitiveness
  • Advantage: visualization
  • Advantage: direct utilization of generic graph algorithms for useful purposes (e.g. shortest path)
  • Advantage: lack of ontological commitment makes agreement on standards like biolink-model easier
  • Disadvantage: less power obtained from OWL deductive reasoning (but transforms are possible, see below)
  • Disadvantage: becomes awkward to model contextual statements and more complex scenarios (e.g. GO-CAMs)


OWL (Logical) Modeling Features and Patterns

Note the assumption here is that we are modeling the full connections between entities in a domain (what are sometimes called annotations). For developing ontologies, I assume that direct modeling as an OWL TBox using OWL axioms is always best.

Principles of logical modeling

  1. Classes and instances assumed distinct. Punning is valid in OWL2, and is sometimes unavoidable when following a KG pattern layered on RDF/OWL) but I consider it’s use in a logical modeling context a bad smell.
  2. Many named bio-entities modeled as classes. Example: ‘human Shh gene’, ‘Fanconi anemia’, ‘metformin’, ‘nucleus’; even potentially rs34778348, But not: ‘patient123’.
  3. Classes and Class-level knowledge typically encoded in ontologies within OBO library or equivalent. Example: PD SubClassOf neurodegenerative disease; every nucleus is part of a cell; every digit is part-of some autopod; nothing is part of both a nucleus and a cytoplasm. There are multiple generally agreed upon modeling principles, and general upper ontology agreement here.
  4. Instances and instance-level knowledge typically encoded OUTSIDE ontologies. Example: data about a patient, or a particular tissue sample (although this is borderline, see for example our FANTOM5 ontology)
  5. OWL semantics hold. E.g. if an ontology says chemical A disjoint-with chemical B, and we have a drug class that is a subclass of both, the overall model is incoherent. We are compelled to model things differently (e.g. using has-part)
  6. ‘Standard Annotations’ typically modeled as some-some. The concept of ‘ontology annotation’ in biocuration is typically something like assigning ontology terms to entities in the domain (genes, variants, etc). In the default case ‘annotations’ are assumed to not hold in an all-some fashion. E.g. if we have a GO annotation of protein P to compartment C, we do not interpret as every instance of P being part of some instance of C. A safe default modeling assumption is some-some, but it is also possible to model in terms of dispositions (which is essentially how the NCIT ontology connects genes to processes and diseases). Note that when all-some is used for modeling we get into odd situations such as interaction relationships needing to be stated reciprocally. See Lmn-2 interacts with Elf-2. On the meaning of common statements in biomedical literature by Stefan Shulz and Ludger Jansen for an extended treatment. Note that in KG modeling, this entire issue is irrelevant.
  7. Reification/NGs typically reserved for metadata/provenance. Reification (e.g. using either rdf or owl vocabs) reserved for information about the axiom. The same holds of annotating named graphs. In either case, the reified node or the NG is typically not used for biological information (since it would be invisible to the reasoner). Reification-like n-ary patterns may be used to introduce new biological entities for more granular modeling.
  8. Instances typically introduced to ensure logical correctness. A corollary of the above is that we frequently introduce additional instances to avoid incorrect statements. For example, to represent a GO cell component annotation we may introduce an instance p1 of class P and an instance p2 of class C, and directly connect p1 and p2 (implicitly introducing a some-some relationship between P and C). See below for examples.
  9. Instances provide mechanism for stating context. As per previous rule, if we have introduced context-specific instances, we can arbitrarily add more properties to these. E.g. that p1 is phosphorylated, or p1 is located in tissue1 which is an epithelium.
  10. Introduced instances should have IRIs minted. Blank nodes may be formally correct, but providing URIs has advantages in querying and information management. IRIs may be hashed skolem terms or UUIDs depending on the scenario. We informally designate these as ‘pseudo-anonymous’, in that they are not blank nodes, but share some properties (e.g. typically not assigned a rdfs:label, their URIs does not correspond 1:1 to a named entity in the literature). Note 1: we use the term ‘introduced instance’ to indicate an instance created by the modeler, we assume for example ‘patient123’ already has an IRI. Note 2: OWL axioms may translate to blank nodes as mandated by the OWL spec.
  11. Deductive reasoning performs useful work. This is a consequence of OWL-semantics holding. Deductive (OWL) reasoning should ‘do work’ in the sense of providing useful inferences, either in the form of model checking (e.g. in QC) or in the ability to query for implicit relationships. If reasoning is not performing useful work, it is a sign of ‘pseudo-precision’ or overmodeling, and that precise OWL level modeling may not be called for and a simpler KGM may be sufficient (or that the OWL modeling needs changed).


Advantages/Disadvantages of Logical Modeling

  • Advantage: formal correctness and coherency
  • Advantage: reasoning performs useful work
  • Advantage: representing contextual statements naturally
  • Advantage: changing requirements resulting in additional granularity or introduction of context can be handled gracefully by additing to existing structures
  • Disadvantage: Additional nodes and edges in underlying RDF graph
  • Disadvantage: Impedance mismatch when using neo4j or assuming the underlying graph has properties of KGM (e.g. hopping from one named entity to another)

Example: phenotypes

Consider a simple KG for connecting patients to phenotypes. We can make edges:

  • Patient123 rdf:type Human (NCBITaxon class)
  • Patient123 has-phenotype ‘neuron degeneration’ (HP class)
  • Patient123 has-phenotype ‘tremor’ (HP class)
  • Etc

(OWL experts will immediately point out that this induces punning in the OWL model; the Neo4j modeler does not know or care what this is).

Now consider the scenario where we want to produce additional contextual info about the particular kind of neuron degeneration, or temporal information about the tremors; and the ontology does not pre-coordinate the terms we need.

One approach is to add additional properties to the edge. E.g. location, onset. This is often sufficient for simple use cases, clients can choose to ask for additional specificity when required. However, there are advantages to putting the context on the node. Note of course that it is not correct to add an edge

  • ‘neuron degeneration’ has-location ‘striatum’

Since we want to talk about the particular ‘neuron degeneration’ happening in the context of patient123. This is where we might want to employ instance-oriented OWL modeling. The pattern would be

  • Patient123 rdf:type Human
  • Patient123 has-phenotype :p1
  • :p1 rdf:type ‘neuron degeneration’
  • :p1 located-in :l1
  • :l1 rdf:type ‘striatum’
  • Patient123 has-phenotype …

This introduces more nodes and edges, but gives a coherent OWL model that can do useful work with reasoning. For example, if a class ‘striatal neuron degeneration’ is later introduced and given an OWL definition, we infer that Patient123 has this phenotype. Additionally, queries for example for ‘striatal phenotypes’ will yield the correct answer.

Hybrid Modeling

It is possible to mix these two modes. We can treat the KG layer as being ‘shortcuts’ that optionally compile down to more granular representations. Also, the KG layer can be inferred via reasoning from the more granular layer. Stay tuned for more posts on these patterns…