Knowledge Graphs

Edge properties, part 1: Reification

This is the first part of what will be a multi-part series. See also part 2 on the singleton property pattern.

One of the ways in which RDF differs from Labeled Property Graph (LPG) models such as the data model in Neo4J is that there is no first-class mechanism for making statements about statements. For example, given a triple :P1 :interacts-with :P2, how do we say that triple is supported by a particular publication?

With an LPG, an edge can have properties associated with it in addition to the main edge label. In Neo4J documentation, this is often depicted as tag-values underneath the edge label. So if the assertion that P1 interacts with P2 is supported by a publication such as PMID:123 we might write this as:

(Note that some datamodels such as Neo4J don’t directly support hypergraphs, and if we wanted to represent pmid:123 as a distinct node with its own propertiess, then the association between the edge property and the node would be implicit rather than explicit)

In RDF, properties cannot be directly associated with edges. How would we represent something like the above in RDF? In fact there are multiple ways of modeling this.

A common approach is reification. Here we would create an extra node representing the statement, associate this with the original triple via three new triples, and then the statement node can be described as any other node. E.g.

This can be depicted visually as follows (note that while the first triple directly connecting P1 and P2 may seem redundant, it is not formally entailed by RDF semantics and should also be stated):

This is obviously quite verbose, so there are a different visual conventions and syntactic shortcuts to reduce bloat.

RDF* provides a more convenient compact syntax for writing edge properties:

  • <<:P1 :interacts_with :P2>>  :supported_by :pmid123 .

Here the <<…>> can be seen as acting as syntactic sugar, with the above single line of RDF* expanding to the 6 triples above.

RDF* is not yet a W3 standard, but a large number of tools support it. It is accompanied by SPARQL* for queries.

There is a lot more to be said about the topic of edge properties in LPGs and RDF, I will try to cover these in future posts. This includes:

  • Alternatives to RDF reification, of which there are many
    • Named Graphs, which capitalize on the fact that triplestores are actually quad stores, and use the graph with which a triple is associated with as a site of attachment for edge properties.
    • The Singleton Property Pattern (SPP). This has some adherents, but is not compatible with OWL-DL modeling. This is addressed in part two of this series
    • Alternative Reification Vocabularies. This includes the OWL reification vocabulary. It’s immensely depressing and confusing and under-appreciated that OWL did not adopt the RDF reification vocabulary, and the OWL stack fails horribly when we try and use the two together. Additionally OWL reification comes with annoying limitations (see my answer on stack overflow about RDF vs OWL reification).
    • RDF* can be seen as an alternative or it can be seen as syntactic sugar and/or a layer of abstraction over existing RDF reification
    • various other design patterns such as those in https://www.w3.org/TR/swbp-n-aryRelations/
  • Semantics of reification. RDF has monotonic semantics. This means that adding new triples (including reification triples) cannot retract the meaning of any existing triples (including the reified triples). So broadly speaking, it’s fine to annotate a triple with metadata (e.g. who said it), but not with something that alters it’s meaning (e.g. a negation qualifier, or probabilistic semantics). This has implications on how we represent knowledge graphs in RDF, and on proposals for simpler OWL layering on RDF. It also has implications for inference with KGs, both classic deductive boolean inference as well as modern KG embedding and associated ML approaches (e.g node2vec, embiggen).
  • Alternate syntaxes and tooling that is compatible with RDF and employs higher level abstractions above the verbose bloated reification syntax/model above. This includes RDF*/SPARQL* as well as KGX.

Next: Edge properties, part 2: singleton property pattern (and why it doesn’t work)

Announcements · Knowledge Graphs · ML

Building a COVID-19 Knowledge Graph

With COVID-19 cases continuing to grow in number across the globe, scientists are forming new collaborations in order to better understand all aspects of SARS-CoV-2 together with its impact on human health. One aspect of this is organizing existing and emerging information about viral and host cell molecular biology, disease epidemiology, phenotypic progression, and effect of drugs and other treatments in individuals.

Knowledge Graphs (KGs) provide a way to organize complex heterogeneous information connecting different biological and clinical entities such as genes, drugs, diseases, exposures, phenotypes, and pathways.

For example, the following image shows a graphical (network) representation of SARS-CoV-2 proteins and host human proteins they are hypothesized to interact with, together with existing known human-human protein interactions, annotated with GO terms and drug target information:

ETw3ESsX0AQIFrK
SARS-CoV-2 host interaction map; taken from https://www.biorxiv.org/content/10.1101/2020.03.22.002386v1

Graphs such as this can be further extended with other information about the human and viral genes as it becomes available. Mechanisms such as endocytosis can also be included as nodes in the graph, as well as expression states of relevant human cells, etc.  Existing ontologies like GO, HPO, Mondo, and CHEBI, together with their annotations can be conceived of as KGs.

Screen Shot 2020-04-05 at 7.38.51 PM
Portion of a KG formed from GO, Mondo, HPO, Genes, and their inter-relationships

These KGs can be used as data warehouses for querying data integrated in a single place. They can also be used as sources in Machine Learning, for tasks such as link prediction. For example: which compounds might be likely to treat a particular disease, based on properties of both the compound and the disease.

The KG-COVID-19 Knowledge Graph Hub

As part of a collaboration between the Monarch Initiative, the Illuminating the Druggable Genome KG project, and PheKnowLater we have been collaboratively building a KG for COVID-19. All of the source is in GitHub, in the Knowledge-Graph-Hub/kg-covid-19 repository.

The project is built around the concept of a KG “Hub”, a lightweight way to build a KG from multiple upstream sources. Any developer can follow the instructions to ingest a new source, and make a Pull Request on the repo. So far we have a number of different sources ingested (detailed in the yaml file), and more on the way. The output is a simple biolink-model compliant KG in a simple TSV format that is compatible with Property Graphs (e.g. Neo4J) as well as RDF graphs. In all cases we use CURIEs that are equivalent to standard URIs, such as OBO Class PURLs.

One of the goals is to use this alongside our N2V framework to discover new links (for example, identifying existing drugs that could be repurposed to treat COVID-19) and generate actionable knowledge.

Screen Shot 2020-04-05 at 7.00.12 PM

 

Knowledge Graphs at the Virtual Biohackathon

The COVID-19 Biohackathon is a virtual event starting today (April 5 2020), lasting for a week, with the goal to “create a cohesive effort and work on tooling for COVID-19 analysis. The biohackathon will lead to more readily accessible data, protocols, detection kits, protein predictions etc.“. The Biohackathon was spearheaded by many of the same people behind the yearly Biohackathon which I have previously reported on.

One of the subgroups at the hackathon is the KnowledgeGraph group. This includes the kg-covid-19 contributors and other luminaries from the life sciences linked data / KG world, including neXtProt, UniProt, KnetMiner, Monarch, HPO, IDG-KG, GO.

I’m excited to see all these people working together as part of a dynamic group to produce tools that aim to help elucidate some of the biology underlying this critical threat. Of course, this is just one very small part of a massive global effort (really what we need to tackle COVID-19 is better public health infrastructure, widespread testing, ventilators, PPE for medical staff and workers on the front line, etc, see How the Pandemic Will End by Ed Jong). But I also think that this is an opportunity for collaborating on some of the crucial knowledge-based tools that have wide applications in biomedicine.

If you want to know more, the details of the biohackathon can be found on its GitHub page, and the kg-covid-19 repository can be found here, with contributor guidelines here.

 

 

Knowledge Graphs · Ontologies · Reasoning · Standards · Uncategorized

Proposed strategy for semantics in RDF* and Property Graphs

Update 2020-09-12: I created a GitHub repo that concretizes part of the proposal here https://github.com/cmungall/owlstar

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z
Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)
proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>>  owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png
Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png
protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)
Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

Knowledge Graphs

Biological Knowledge Graph Modeling Design Patterns

This document provides an overview of two modeling strategies/patterns used for building knowledge graphs and triplestores of core biological ‘knowledge’ (e.g. relations between genes, chemicals, diseases, environments, phenotypes, diseases, variants). I call these patterns Knowledge Graph Modeling and OWL (aka logical) modeling. These are complementary and can work together, but I have found it to be useful to always be aware of the ‘mode’ one if working in.

I don’t have a formal definition of ‘knowledge graph’. I realize it is in part a marketing term, but I think there are some key features that are commonly associated with KGs that may distinguish them from the way I have modeled things in RDF/OWL. In particular KGs are more commonly associated with property graphs and technologies such as Neo4J, and naturally accommodate information on edges (not just provenance, but things that have a semantic impact). In contrast, RDF/OWL modeling will more commonly introduce nodes for these, and place these nodes in the context of an ontology.

I found this slide to be a pretty useful definition of the salient features of a KG (slide from Uber’s Joshua Shinavier from this week’s US2TS meeting):

https://twitter.com/mrvaidya/status/1105154093180432386

uber
type and identity of each vertex and edge meaningful to both humans and software; emphasize human understanding; success of graph data models has much to do with psychology; sub-symbolic data sets e.g. ML models are not KGs. KGs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume (Paul Groth)

Some other thoughts on KG from members of the semantic web community:

Here, rather than focusing on a definition I attempt to identify two clusters of modeling patterns. I have found this to be useful for some of the work we have done on different biological data integration, curation, and ontology projects. In particular, for the NCATS Translator project, one of the things we are working on is merging multiple KGs from multiple different teams, where different teams use different technologies (e.g. Neo4J and Triplestores) and where each team builds KGs with different purposes.

I am curious how well these translate to different domains (if at all). The life sciences may be unusual in having so many named entities such as genes and drugs that are in a quantum superposition of being instance-like, named, countable things while at the same time being class-like, repeated things that vary in their instantiation according to context. This ends up having a big impact on data modeling.

cat
Genes have Schrodinger’s cat qualities, with class-like characteristics and instance-like characteristics, depending on how you look at it

Knowledge Graph Modeling Features and Patterns

Rather than start with a definition, I give as an illustrative example a graphic of a schema from a Neo4J database of biological entities (from this tweet from Daniel Himmelstein)

C0E8_9xWIAEXeKr

Simple Rules of KGM

  1. Is knowledge represented as a graph in some meaningful way. Any old conversion of data to a neo4j database or RDF does not count. It should be meaningfully connected, with traversals allowing us to broadly see the connectedness of some piece of biology. It should be more than just an ontology, and should include connections between the named entities in the domain. This is not a formal definition: like art, I know it when I see it.
  2. Each node in the graph should correspond to some named thing in the domain; name here applies to either a human-friendly name or recognized databased entity. for example, ‘human Shh’, ‘Fanconi anemia’, ‘metformin’, ‘glial cell’, ‘patient123’, rs34778348
  3. Edges connecting nodes must have a relationship type. (e.g. ‘treats’, ‘has phenotype’, ‘located in’)
  4. Edges should form sentences that are meaningful to a domain scientist or clinician (e.g. ‘ibuprofen treats headache’, ‘Parkinson disease has-feature Tremor’, ‘nucleus part-of cell’)
  5. Inference framework neutral. Inference frameworks include logical deductive reasoning, probabilistic inference, ad-hoc rules. A KG may include edges with probabilities attached with the intent of calculating the probability of subgraphs using the chain rule; or it may include logical quantifiers; or none of the above, and may instead be intended to loosely specify a piece of knowledge (e.g. a classic semantic network)
  6. Commitment to precise logical semantics are not important at this level. This is partially a restatement of the previous rule. Specifically: we do not necessarily care whether ‘ibuprofen’ or ‘human Shh’ is an owl class or instance (it’s just a node), and we do not require ontological commitment about logical quantification on edges.
  7. Edges can have additional information attached. This includes both generic metadata (provenance, evidence) and also biologically important information. E.g. penetrance for a gene-phenotype edge; positional info for a gene-chromosome association. It can also include logical qualifiers and additional semantics, probabilities, etc. There may be different mechanisms for attaching this information (for neo4j, property graphs; for RDF, named graphs or reification), the mechanism is not so important here.
  8. Graph theoretic operations do useful work. E.g. shortest path between nodes. Spreading activation, random walks. Also knowledge graph machine learning techniques, such as those based off of node embeddings, e.g. Knowledge Graph Completion.
  9. Modeling should follow standard documented design patterns. Relationship types should be mapped to an ontology such as RO or SIO. In the NCATS Translator project, we specify that Entity types and Association types should be catalogued in biolink-model
  10. Ontology relationships modeled as single edges. KGMs frequently include ontologies to assist traversal. Some OWL axioms (e.g. Nucleus SubClassOf part-of some Cell) are encoded as multiple RDF triples – these must be converted to single edges in a KG. Optionally, the specific semantics (i.e OWL quantifier) can be added as an edge property if a property graph is used. See the Proposed strategy for semantics in RDF* and Property Graphs.
  11. A slim ontology of high level upper ontology classes is used for primary classification. Due to de-emphasis on reasoning it is useful to have a primary classification to a small set of classes like gene, protein, disease, etc. In Neo4j these often form the ‘labels’. See the  biolink-model node types. The forthcoming OBO-Core project is also a good candidate. Detailed typing information can also be added.

Examples of KGM

Advantages/Disadvantages of KGM

  • Advantage: simplicity and intuitiveness
  • Advantage: visualization
  • Advantage: direct utilization of generic graph algorithms for useful purposes (e.g. shortest path)
  • Advantage: lack of ontological commitment makes agreement on standards like biolink-model easier
  • Disadvantage: less power obtained from OWL deductive reasoning (but transforms are possible, see below)
  • Disadvantage: becomes awkward to model contextual statements and more complex scenarios (e.g. GO-CAMs)

OWL (Logical) Modeling Features and Patterns

Note the assumption here is that we are modeling the full connections between entities in a domain (what are sometimes called annotations). For developing ontologies, I assume that direct modeling as an OWL TBox using OWL axioms is always best.

Principles of logical modeling

  1. Classes and instances assumed distinct. Punning is valid in OWL2, and is sometimes unavoidable when following a KG pattern layered on RDF/OWL) but I consider it’s use in a logical modeling context a bad smell.
  2. Many named bio-entities modeled as classes. Example: ‘human Shh gene’, ‘Fanconi anemia’, ‘metformin’, ‘nucleus’; even potentially rs34778348, But not: ‘patient123’.
  3. Classes and Class-level knowledge typically encoded in ontologies within OBO library or equivalent. Example: PD SubClassOf neurodegenerative disease; every nucleus is part of a cell; every digit is part-of some autopod; nothing is part of both a nucleus and a cytoplasm. There are multiple generally agreed upon modeling principles, and general upper ontology agreement here.
  4. Instances and instance-level knowledge typically encoded OUTSIDE ontologies. Example: data about a patient, or a particular tissue sample (although this is borderline, see for example our FANTOM5 ontology)
  5. OWL semantics hold. E.g. if an ontology says chemical A disjoint-with chemical B, and we have a drug class that is a subclass of both, the overall model is incoherent. We are compelled to model things differently (e.g. using has-part)
  6. ‘Standard Annotations’ typically modeled as some-some. The concept of ‘ontology annotation’ in biocuration is typically something like assigning ontology terms to entities in the domain (genes, variants, etc). In the default case ‘annotations’ are assumed to not hold in an all-some fashion. E.g. if we have a GO annotation of protein P to compartment C, we do not interpret as every instance of P being part of some instance of C. A safe default modeling assumption is some-some, but it is also possible to model in terms of dispositions (which is essentially how the NCIT ontology connects genes to processes and diseases). Note that when all-some is used for modeling we get into odd situations such as interaction relationships needing to be stated reciprocally. See Lmn-2 interacts with Elf-2. On the meaning of common statements in biomedical literature by Stefan Shulz and Ludger Jansen for an extended treatment. Note that in KG modeling, this entire issue is irrelevant.
  7. Reification/NGs typically reserved for metadata/provenance. Reification (e.g. using either rdf or owl vocabs) reserved for information about the axiom. The same holds of annotating named graphs. In either case, the reified node or the NG is typically not used for biological information (since it would be invisible to the reasoner). Reification-like n-ary patterns may be used to introduce new biological entities for more granular modeling.
  8. Instances typically introduced to ensure logical correctness. A corollary of the above is that we frequently introduce additional instances to avoid incorrect statements. For example, to represent a GO cell component annotation we may introduce an instance p1 of class P and an instance p2 of class C, and directly connect p1 and p2 (implicitly introducing a some-some relationship between P and C). See below for examples.
  9. Instances provide mechanism for stating context. As per previous rule, if we have introduced context-specific instances, we can arbitrarily add more properties to these. E.g. that p1 is phosphorylated, or p1 is located in tissue1 which is an epithelium.
  10. Introduced instances should have IRIs minted. Blank nodes may be formally correct, but providing URIs has advantages in querying and information management. IRIs may be hashed skolem terms or UUIDs depending on the scenario. We informally designate these as ‘pseudo-anonymous’, in that they are not blank nodes, but share some properties (e.g. typically not assigned a rdfs:label, their URIs does not correspond 1:1 to a named entity in the literature). Note 1: we use the term ‘introduced instance’ to indicate an instance created by the modeler, we assume for example ‘patient123’ already has an IRI. Note 2: OWL axioms may translate to blank nodes as mandated by the OWL spec.
  11. Deductive reasoning performs useful work. This is a consequence of OWL-semantics holding. Deductive (OWL) reasoning should ‘do work’ in the sense of providing useful inferences, either in the form of model checking (e.g. in QC) or in the ability to query for implicit relationships. If reasoning is not performing useful work, it is a sign of ‘pseudo-precision’ or overmodeling, and that precise OWL level modeling may not be called for and a simpler KGM may be sufficient (or that the OWL modeling needs changed).

Advantages/Disadvantages of Logical Modeling

  • Advantage: formal correctness and coherency
  • Advantage: reasoning performs useful work
  • Advantage: representing contextual statements naturally
  • Advantage: changing requirements resulting in additional granularity or introduction of context can be handled gracefully by additing to existing structures
  • Disadvantage: Additional nodes and edges in underlying RDF graph
  • Disadvantage: Impedance mismatch when using neo4j or assuming the underlying graph has properties of KGM (e.g. hopping from one named entity to another)

Example: phenotypes

Consider a simple KG for connecting patients to phenotypes. We can make edges:

  • Patient123 rdf:type Human (NCBITaxon class)
  • Patient123 has-phenotype ‘neuron degeneration’ (HP class)
  • Patient123 has-phenotype ‘tremor’ (HP class)
  • Etc

(OWL experts will immediately point out that this induces punning in the OWL model; the Neo4j modeler does not know or care what this is).

Now consider the scenario where we want to produce additional contextual info about the particular kind of neuron degeneration, or temporal information about the tremors; and the ontology does not pre-coordinate the terms we need.

One approach is to add additional properties to the edge. E.g. location, onset. This is often sufficient for simple use cases, clients can choose to ask for additional specificity when required. However, there are advantages to putting the context on the node. Note of course that it is not correct to add an edge

  • ‘neuron degeneration’ has-location ‘striatum’

Since we want to talk about the particular ‘neuron degeneration’ happening in the context of patient123. This is where we might want to employ instance-oriented OWL modeling. The pattern would be

  • Patient123 rdf:type Human
  • Patient123 has-phenotype :p1
  • :p1 rdf:type ‘neuron degeneration’
  • :p1 located-in :l1
  • :l1 rdf:type ‘striatum’
  • Patient123 has-phenotype …

This introduces more nodes and edges, but gives a coherent OWL model that can do useful work with reasoning. For example, if a class ‘striatal neuron degeneration’ is later introduced and given an OWL definition, we infer that Patient123 has this phenotype. Additionally, queries for example for ‘striatal phenotypes’ will yield the correct answer.

Hybrid Modeling

It is possible to mix these two modes. We can treat the KG layer as being ‘shortcuts’ that optionally compile down to more granular representations. Also, the KG layer can be inferred via reasoning from the more granular layer. Stay tuned for more posts on these patterns…