Announcements · linkml · Ontologies · Standards

Using ontologies within data models and standards

Ontologies are ways of organizing collections of concepts that can be used to annotate data. Ontologies are complemented by schemas, data models, reporting formats and standards, which describe how data should be be structured.

Some examples:

As an example how ontologies are used, the following figure illustrates the schema used for the Data Coordination Center for the ENCODE project (taken from Malladi et al https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4360730/)

Figure 2 from Ontology application and use at the ENCODE DCC (PMC4360730)

You can get a sense for which ontologies are used for which standards on the FAIRsharing site – for example for Uberon:

Fairsharing usage graph for Uberon (link)

Despite this ubiquity, most ontologies are bound to standards and schemas in a very loose way. Usually there is accompanying free text instructing humans to do something like “use branch B of ontology O for filling in this field”, but these instructions are open to interpretation and don’t avail themselves to automated computational checking, or for driving data entry forms.

Furthermore, many of these ontologies are huge and selecting the right term can be a challenge for whoever is providing the metadata – a scientist submitting the data, a data wrangling providing first-pass curation, or a biocurator at a knowldge base. There is a missed opportunity to provide selected subsets of the ontology that are relevant to a particular use case. For example, when providing environmental context for a terrestrial soil sample it’s not necessary to use all three thousand terms in ENVO, terms for marine biomes or terms that describe the insides of buildings or humans are not relevant.

And in fact many schema languages like JSON-Schema lack the ability to bind ontologies to field values. There is a need for this demonstrated by this discussion on the JSON-Schema discussion group. Many groups (Human Cell Atlas, AIRR) have developed their own bespoke extensions to JSON schema or related formalisms like Open API that allow the schema designer to specify a query for ontology terms that is then executed against a service like the Ontology Lookup Service. Here is an example from AIRR:

        species:
            $ref: '#/Ontology'
            nullable: false
            description: Binomial designation of subject's species
            title: Organism
            example:
                id: NCBITAXON:9606
                label: Homo sapiens
            x-airr:
                miairr: essential
                adc-query-support: true
                set: 1
                subset: subject
                name: Organism
                format: ontology
                ontology:
                    draft: false
                    top_node:
                        id: NCBITAXON:7776
                        label: Gnathostomata

However, none of these JSON-Schema extensions are official, and from the discussion it seems this is unlikely to happen soon. JSON-Schema does offer enums, a common construct in data modeling, which allows field values to be constrained to a simple flat dropdown of strings. Standard enums are not FAIR, because they offer no way to map these strings to standard terms from ontologies.

The clinical informatics community has dealt with the problem of combining ontologies and terminologies with data models for some time, and after various iterations of HL7, the FHIR (Fast Healthcare Interoperability Resources) provides a way to do this via Value Sets. However, the FHIR solution is tightly bound with the FHIR data model which is not always appropriate for modeling non-healthcare data.

Using Ontologies with LinkML

LinkML (Linked Data Modeling Language) is a polymorphic pluralistic data modeling language designed to work in concert with and yet extend and add semantics to existing data modeling solutions such as JSON-Schema and SQL DDL. LinkML provides a simple standard way to describe schemas, data models, reporting standards, data dictionaries, and to optionally adorn these with semantics via IRIs from standard vocabularies and ontologies, ranging from schema.org through to rich OBO ontologies.

From the outset, LinkML has supported the ability to provide “annotated enums” (also known as Value Sets), extending the semantics of enums of JSON-Schema, SQL, and object oriented languages.

For example, the following enum illustrates a set of hardcoded Permissible Values mapped to to terms from the GA4GH pedigree standard kinship ontology

enums:
  FamilialRelationshipType:
    permissible_values:
      SIBLING OF:
        description: A family relationship where the two members have a parent on common
        meaning: kin:KIN_007
      PARENT OF:
        description: A family relationship between offspring and their parent
        meaning: kin:KIN_003
      CHILD OF:
        description: inverse of the PARENT_OF relationship
        meaning: kin:KIN_002

Each permissible value is optionally annotated with a “meaning” tag, which has a CURIE denoting a term from a vocabulary or external resource. Each permissible value can also be adorned with a description, as well as other metadata.

The use of ontology mappings here gives us an interoperability hook – two different schemas can interoperate via the use of shared standard terms, even if they want to present strings that are familiar to a local community.

This all works well for relatively small value sets. But what happens if I want to have a field in my schema whose value can be populated by any term from the Eukaroyte branch of NCBI Taxonomy? It’s not really practical to include a ginormous list of IDs and terms in a schema – especially if it’s liable to change.

Dynamic Enums

Starting with the new LinkML 1.3 model release today, it is possible to describe dynamic value sets in your schema. Rather than hardcoding a list of terms, you instead provide a declarative statement saying how to construct the value set, allowing binding to terms to happen later.

Let’s say you want a field in your standard to be populated with any of the subtypes of “neuron” in the cell ontology, you could do it like this:

enums:
  NeuronTypeEnum:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000540 ## neuron
      include_self: false
      relationship_types:
        - rdfs:subClassOf

This value set is defined in terms of an ontology graph query, any term that is reachable from the node representing the base neuron class, walking down subClassOf edges.

Dynamic enums can also be composed together, either by nesting boolean expressions to arbitrary depth, or by naming sub-patterns and reusing them.

A more complex example involves the use of two enums, the first a general one for any kind of disease, and the second one both extends and restricts this, extending it to include cancer terms from a different vocabulary, and restricting to exclude non-human animal diseases:

enums:
  Disease:
    reachable_from:
      source_ontology: obo:mondo
      source_nodes:
        - MONDO:0000001 ## disease or disorder
      is_direct: false
      relationship_types:
        - rdfs:subClassOf

  HumanDisease:
    description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases
    inherits:
      - Disease
    include:
      - reachable_from:
          source_ontology: obo:ncit
          source_nodes:
            - NCIT:C3262
    minus:
      - reachable_from:
          source_ontology: obo:mondo
          source_nodes:
            - MONDO:0005583 ## non-human animal disease
          relationship_types:
            - rdfs:subClassOf

Tooling support

There is already support for LinkML static enums. When LinkML is used to generate JSON-Schema or SQL DDL then these are mapped to the corresponding constructs (but with loss of metadata).

LinkML-native tools such as Data Harmonizer support enums as hierarchical drop-downs in data entry.

For the newer dynamic enums, currently the focus is on standardizing how to specify these in a standard way, with expectations of tooling support to come soon.

This is something that is difficult to have a one-size fits all solution, due to the variety of use cases here. Consider the task of validating a schema with dynamic enums. In some cases, you may want to do this at run time, with a dynamic query against a remove ontology server. In other cases you might prefer not avoid network dependency at validation time, instead opting for either local ontology lookup, or a pre-materialization of the value set. One way to do this is to add in ontology lookups at time of compiling a JSON-Schema.

For the ontology lookups there are a variety of options. Bioportal includes almost a thousand ontologies covering most terminological uses in the life sciences, with ontologies in the OntoPortal alliance covering other domains like agriculture, ecology, and materials science. OLS also has a large number of ontologies also in the life sciences. But these ontology browsers don’t cover all entities, such as all proteins in UniProt. And for some use cases it may be necessary to use bespoke vocabularies. Tooling support for dynamic enums should cover all these scenarios.

This is something that could be supplied using the new Ontology Access Kit (OAK) library which provides a unifying layer over multiple different ontology sources, including hosted ontology portals like Bioportal, OLS, and Ontobee, local files in a variety of formats, local and remote SPARQL endpoints, including both biological databases, as well as Wikidata and Linked Open Vocabularies.

Trying it out

Head over to the LinkML documentation for information and tutorials on how to use LinkML and LinkML enums. Also check out schemasheets, an easy way to author schemas/standards, including use of enums, all using google sheets (or Excel, if you must).

Whether you opt to use LinkML or not, if you are involved in the creation of standards or data models, or if you are involved in the provision of ontologies that are used in these standards, I hope this post is useful for you in thinking about how to structure things in a way that makes these standards more rigorous, easier to use by scientists, and easier to systematically validate and automatically drive interfaces!

Advertisement
Design Patterns · Ontologies · Reasoning · ROBOT

Avoid mixing parthood with cardinality constraints

We frequently have situations where we want to make an assertion combining parthood and cardinality. Here, cardinality pertains to the number of instances.

Section 5.3 of the OWL2 primer has an example of a cardinality restriction,

Individual: John
   Types: hasChild max 4 Parent

This is saying that John has at most 4 children that are parents.

A more natural biological example of a cardinality restriction is stating something like:

Class: Hand
   SubClassOf: hasPart exactly 5 Finger

i.e. every hand has exactly 5 fingers. Remember we are following the open world assumption here – we are not saying that our ontology has to list all 5 fingers. We are stating that in our model of the world, any instance of a hand h entails the existence of 5 distinct finger instances f1…f5, each of which is related to that hand, i.e. fn part-of h. Furthermore there are no other finger instances not in the set f1..5 that are part of h.

The precise set-theoretic semantics and provided in the OWL2 Direct Semantics specification.

This cardinality axiom seems like a perfectly natural, reasonable, and useful thing to add to an ontology (avoiding for now discussions about “canonical hands” and “variant hands”, for example in cases like polydactyly; let’s just assume we are describing “canonical anatomy”, whatever that is).

5 Kids Hand Showing The Number Five Hand Sign Stock Illustration - Download  Image Now - iStock
A canonical hand, with 5 fingers

And in fact there are various ontologies in OBO that do this kind of thing.

However, there is a trap here. If you try and run a reasoner such as HermiT you will get a frankly perplexing and unhelpful error such as this.

An error occurred during reasoning: Non-simple property 'BFO_0000051' or its inverse appears in the cardinality restriction 'BFO_0000051 exactly 5 UBERON_1234567

If you have a background in computer science and you have some time to spare you can go and look at section 11 of the OWL specification (global restrictions on axioms in OWL2 DL) to see what the magical special laws you must adhere to when writing OWL ontologies that conform to the DL profile, but it’s not particularly helpful to understand:

  • why you aren’t allowed to write this kind of thing, and
  • what the solution is.

Why can’t I say that?

A full explanation is outside the scope of this article, but the basic problem arises when combining transitive properties such as part-of with cardinality constraints. It makes the ontology fall outside the “DL” profile, which means that reasoners such as HermiT can’t use it, so rather ignore it HermiT will complain and refuse to reason.

Well I want to say that anyway, what happens if I do?

You may choose to assert the axiom anyway – after all, it may feel useful for documentation purposes, and people can ignore it if they don’t want it, right? That seems OK to me, but I don’t make the rules.

Even if you don’t intend to stray outside DL, an insidious problem arises here: many OBO ontologies use Elk as their reasoner, and Elk will happily ignore these DL violations (as well as anything it can’t reason over, outside it’s variant of the EL++ profile). This in itself is totally fine – its inferences are sound/correct, they just might not be complete. However, we have a problem if an ontology includes these DL violations, and the relevant portion of that ontology is extracted and then used as part of another ontology with a DL reasoner such as HermiT that fails fast when presented with these axioms. In most pipelines, if an ontology can’t be reasoned, it can’t be released, and everything is gummed up until an OWL expert can be summoned to diagnose and fix the problem. Things get worse if an ontology that is an N-th order import has a DL violation, as it may require waiting for all imports in the chain to be re-released. Not good!

Every now and then this happens with an OBO ontology and things get gummed up, and people naturally ask the usual questions, why did this happen, why can’t I say this perfectly reasonable thing, how do I fix this, hence this article.

How do we stop people from saying that?

Previously we didn’t have a good system for stopping people from making these assertions in their ontologies, and the assertions would leak via imports and imports of imports, and gum things up.

Now we have the wonderful robot tool and the relatively new validate-profile command, which can be run like this:

robot validate-profile --profile DL \
  --input my-ontology.owl \
  --output validation.txt

This will ensure that the ontology is in the DL profile. If it isn’t, this will fail, so you can add this to your Makefile in order to fail fast and avoid poisoning downstream ontologies.

This check will soon be integrated into the standard ODK setup.

OK, so how do I fix these?

So you have inadvertently strayed outside the DL profile and your ontology is full of has-parts with cardinality constraints – you didn’t mean it! You were only alerted when a poor downstream ontology imported your axioms and tried to use a DL reasoner. So what do you do?

In all honesty, my advice to you is ditch that axiom. Toss it in the bin. Can it. Flush that axiom down the toilet. Go on. It won’t be missed. It was already being ignored. I guarantee it wasn’t doing any real work for you (here work == entailments). And I guarantee your users won’t miss it.

A piece of advice often given to aspiring writers is to kill your darlings, i.e. get rid of your most precious and especially self-indulgent passages for the greater good of your literary work. The same applies here. Most complex OWL axioms are self-indulgences.

Even if you think you have a really good use case for having these axioms, such as representing stoichiometry of protein complexes or reaction participants, the chances are that OWL is actually a bad framework for the kind of inference you need, and you should be using some kind of closed-world reasoning system, e.g. one based on datalog.

OK, so maybe you don’t believe me, and you really want to say something that approximates your parts-with-numbers. Well, you can certainly weaken your cardinality restriction to an existential restriction (provided the minimum cardinality is above zero; for maxCardinality of zero you can use a ComplementOf). So in the anatomical example we could say

Class: Hand
   SubClassOf: hasPart some Finger

This is still perfectly sound – it is not as complete as your previous statement, but does that matter? What entailments were you expecting from the cardinality axiom. If this is intended more for humans, you can simply annotate your axiom with a comment indicating that humans typically have 5 fingers.

OK, so you still find this unsatisfactory. You really want to include a cardinality assertion, dammit! Fine, you can have one, but you won’t like it. We reluctantly added a sub-property of has-part to RO called ‘has component’:

In all honesty the definition for this relation is not so great. Picking holes in it is not so hard. It exists purely to get around this somewhat ridiculous constraint, and for you to be able to express your precious cardinality constraint, like this:

Class: Hand
   SubClassOf: hasComponent exactly 5 Finger

So how does this get around the constraint? Simple: hasComponent is not declared transitive. (recall that transitivity is not inferred down a property hierarchy). Also it is a different relation (a subproperty) from has-part, so you might not get the inferences you expect. For example, this does NOT prevent me from making an instance of hand that has as parts 6 distinct fingers – it only applies to the more specific relation, which we have not asserted, and is not inferred.

You can make an argument that this is worse than useless – it gives no useful entailments, and it confuses people to boot. I am responsible for creating this relation, and I have used it in ontologies like Uberon, but I now think this was a mistake.

Other approaches

There are other approaches. For a fixed set of cardinalities you could create subproperties, e.g. has-1-part-of, has-2-parts-of, etc. But these would still be less expressive than you would like, and would confuse people.

A pattern that does work in certain cases such as providing logical definitions for things like cells by number of nuclei is to use the EL-shunt pattern (to be covered in a future article) and make characteristic classes in an ontology like PATO.

While this still isn’t as expressive, it allows you to use proxies for cardinality in logical definitions (which do actual work for you), and shunt off the cardinality reasoning to a smaller ontology — where really it’s actually fine to just assert the hierarchy.

But this doesn’t work in all cases. There is no point making characteristics/qualities if they are not reused. It would be silly to do this with the hand example (e.g. making a has5fingers quality).

Isn’t this all a bit annoying?

Yes, it is. In my mind we should be free to state whatever axioms we need for our own use cases, expressivity be damned. I should be able to go outside DL, in fact to use anything from FOL or even beyond. We should not be condemned to live in fear of stepping outside of decidability (which sounds like something you want, but in practice is not useful). There are plenty of good strategies for employing a hybrid reasoning strategy, and in any case, we never use all of DL for most large OBO ontologies anyway.

But we have the technology we do, and we have to live with it for now.

TL;DR

  • don’t mix parthood and cardinality
  • you probably won’t miss that cardinality restriction anyway
  • no really, you won’t
  • use robot to check your profiles

Design Patterns · Ontologies · Ontology Development Kit · Standards

Aligning Design Patterns across Multiple Ontologies in the Life Sciences

I was delighted to give the keynote at http://ISWC 2020 Workshop on Ontology Design and Patterns today. You can see a video or my slides, I’m including a brief summary here.

Opening slide: Aligning Design Patterns Across Multiple Ontologies in the Life Sciences

As this was a CS audience that may be unfamiliar with some of the issues we are tackling in OBO I framed this in terms of the massive number of named entities in the life sciences, all of which have to be categorized if we are to be able to find, integrated, and analyze data:

there are many named things in the life sciences

We created OBO to help organize and integrate the ontologies used to categorize these things

OBO: social and technological framework to categorize all the things

When we started, many ontologies were quite ‘SKOS-like’ in their nature, with simple axiomatization, and a lack of integration:

GO and other ontologies were originally SKOS-like

OWL gives us a framework for more modular development, leveraging other ontologies, and using reasoning to automate classification:

OWL reasoning can be used to maximize leverage in a modular approach: here re-using CHEBI’s chemical classification in GO

This is all great, but when I look at many ontologies I often see two problems, often in the same ontology, under- and over- axiomatization:

Finding the balance between under and over axiomatization

In some ontologies I see what I sometimes call ‘Rococo OWL’, over-axiomatization in an elaborate dense set of axioms that looks impressive but don’t deliver much functional utility (I plan to write about this in a future post).

Rococo: an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe l’oeil frescoes to create surprise and the illusion of motion and drama. The style was highly theatrical, designed to impress and awe at first sight. a movement that extolled frivolity, luxury and dilettantism, patronised by a corrupt and decadent ancien régime. Rococo ended in the revolution of 1789, with the bloody end of a political and economic system

We developed Dead Simple OWL Design Patterns (DOSDPs) to make it easier to write down and reuse common OWL patterns of writing definitions, primarily for compositional classes following the Rector Normalization pattern.

Example DOSDP yaml file

I gave an example of how we used DOSDPs to align logical definitions across multiple phenotype ontologies (the uPheno reconciliation project). I would like to expand on this in a future post.

multiple phenotype databases for humans and model organisms, and their respective vocabularies/ontologies

I finished with some open-ended questions about where we are going and whether we can try and unify different modeling frameworks that tackle things from different perspectives (closed-world shape-based on object-oriented modeling, template/pattern frameworks, lower level logical modeling frameworks).

Unifying multiple frameworks – is it possible or advisable?

Unfortunately due to lack of time I didn’t go into either ROBOT templates or OTTR templates.

And in closing, to emphasize that the community and social aspects are as important or more important than the technology:

Take homes

And some useful links:

http://obofoundry.org/resources

https://github.com/INCATools/

○ ontology-development-kit

dead_simple_owl_design_patterns (dos-dps)

dosdp-tools

Special thanks to everyone in OBO and the uPheno reconciliation effort, especially David-Osumi Sutherland, Jim Balhoff, and Nico Matentzoglu.

And thanks to Pascal Hitzler and Cogan Shimizu for putting together such a great workshop.

Knowledge Graphs · Ontologies · Reasoning · Standards · Uncategorized

Proposed strategy for semantics in RDF* and Property Graphs

Update 2020-09-12: I created a GitHub repo that concretizes part of the proposal here https://github.com/cmungall/owlstar

Graph databases such as Neo4J are gaining in popularity. These are in many ways comparable to RDF databases (triplestores), but I will highlight three differences:

  1. The underlying datamodel in most graph databases is a Property Graph (PG). This means that information can be directly attached to edges. In RDF this can only be done indirectly via reification, or reification-like models, or named graphs.
  2. RDF is based on open standards, and comes with a standard query language (SPARQL), whereas a unified set of standards have yet to arrive for PGs.
  3. RDF has a formal semantics, and languages such as OWL can be layered on providing more expressive semantics.

RDF* (and its accompanying query language SPARQL*) is an attempt to bring PGs into RDF, thus providing an answer for points 1-2. More info can be found in this post by Olaf Hartig.

You can find more info in that post and in related docs, but briefly RDF* adds syntax to add property directly onto edges, e.g

<<:bob foaf:friendOf :alice>> ex:certainty 0.9 .

This has a natural visual cognate:

Mungalls-Ontology-Design-Guidelines (7).png

We can easily imagine building this out into a large graph of friend-of connections, or connecting other kinds of nodes, and keeping additional useful information on the edges.

But what about the 3rd item, semantics?

What about semantics?

For many in both linked data/RDF and in graph database/PG camps, this is perceived as a minor concern. In fact you can often find RDF people whinging about OWL being too complex or some such. The “semantic web” has even been rebranded as “linked data”. But in fact, in the life sciences many of us have found OWL to be incredibly useful, and being able to clearly state what your graphs mean has clear advantages.

OK, but then why not just use what we have already? OWL-DL already has a mapping to RDF, and any document in RDF is automatically an RDF* document, so problem solved?

Not quite. There are two issues with continuing he status quo in the world of RDF* and PGs:

  1. The mapping of OWL to RDF can be incredibly verbose and leads to unintuitive graphs that inhibit effective computation.
  2. OWL is not the only fruit. It is great for the use cases it was designed for, but there are other modes of inference and other frameworks beyond first-order logic that people care about.

Issues with existing OWL to RDF mapping

Let’s face it, the existing mapping is pretty ugly. This is especially true for life-science ontologies that are typically construed of as relational graphs, where edges are formally SubClassOf-SomeValuesFrom axioms. See the post on obo json for more discussion of this. The basic idea here is that in OWL, object properties connect individuals (e.g. my left thumb is connected to my left hand via part-of). In contrast, classes are not connected directly via object properties, rather they are related via subClassOf and class expressions. It is not meaningful in OWL to say “finger (class) part_of hand (class)”. Instead we seek to say “all instances of finger are part_of some x, where x is an instance of a hand”. In Manchester Syntax this has compact form

Finger SubClassOf Part_of some Hand

This is translated to RDF as

Finger owl:subClassOf [

a owl:Restriction ;

owl:onProperty :part_of

owl:someValuesFrom :Hand

]

As an example, consider 3 classes in an anatomy ontology, finger, hand, and forelimb, all connected via part-ofs (i.e. every finger is part of some hand, and ever hand is part of some finger). This looks sensible when we use a native OWL syntax, but when we encode as RDF we get a monstrosity:

z
Fig2 (A) two axioms written in Manchester Syntax describing anatomical relationship between three structures (B) corresponding RDF following official OWL to RDF mapping, with 4 triples per existential axiom, and the introduction of two blank nodes (C) How the axioms are conceptualized by ontology developers, domain experts and how most browsers render them. The disconnect between B and C is an enduring source of confusion among many.

This ugliness was not the result of some kind of perverse decision by the designers of the OWL specs, it’s a necessary consequence of the existing stack which bottoms out at triples as the atomic semantic unit.

In fact, in practice many people employ some kind of simplification and bypass the official mapping and store the edges as simple triples, even though this is semantically invalid. We can see this for example in how Wikidata loads OBOs into its triplestore. This can cause confusion, for example, WD storing reciprocal inverse axioms (e.g. part-of, has-part) even though this is meaningless when collapsed to simple triples.

I would argue there is an implicit contract when we say we are using a graph-based formalism that the structures in our model correspond to the kinds of graphs we draw on whiteboards when representing an ontology or knowledge graph, and the kinds of graphs that are useful for computation; the current mapping violates that implicit contract, usually causing a lot of confusion.

It also has pragmatic implications too. Writing a SPARQL query that traverses a graph like the one in (B), following certain edge types but not others (one of the most common uses of ontologies in bioinformatics) is a horrendous task!

OWL is not the only knowledge representation language

The other reason not to stick with the status quo for semantics for RDF* and PGs is that we may want to go beyond OWL.

OWL is fantastic for the things it was designed for. In the life sciences, it is vital for automatic classification and semantic validation of large ontologies (see half of the posts in this blog site). It is incredibly useful for checking the biological validity of complex instance graphs against our encoded knowledge of the world.

However, not everything we want to say in a Knowledge Graph (KG) can be stated directly in OWL. OWL-DL is based on a fragment of first order logic (FOL); there are certainly things not in that fragment that are useful, but often we have to go outside strict FOL altogether. Much of biological knowledge is contextual and probabilistic. A lot of what we want to say is quantitative rather than qualitative.

For example, when relating a disease to a phenotype (both of which are conventionally modeled as classes, and thus not directly linkable via a property in OWL), it is usually false to say “every person with this disease has this phenotype“. We can invent all kinds of fudges for this – BFO has the concept of a disposition, but this is just a hack for not being able to state probabilistic or quantitative knowledge more directly.

A proposed path forward for semantics in Property Graphs and RDF*

RDF* provides us with an astoundingly obvious way to encode at least some fragment of OWL in a more intuitive way that preserves the graph-like natural encoding of knowledges. Rather than introduce additional blank nodes as in the current OWL to RDF mapping, we simply push the semantics onto the edge label!

Here is example of how this might look for the axioms in the figure above in RDF*

<<:finger :part-of :hand>> owlstar:hasInterpretation
owlstar:SubClassOfSomeValuesFrom .
<<:hand :part-of :forelimb>> owlstar:hasInterpretation owlstar:SubClassOfSomeValuesFrom .

I am assuming the existing of a vocabulary called owlstar here – more on that in a moment.

In any native visualization of RDF* this will end up looking like Fig1C, with the semantics adorning the edges where they belong. For example:

Mungalls-Ontology-Design-Guidelines (8)
proposed owlstar mapping of an OWL subclass restriction. This is clearly simpler than the corresponding graph fragment in 2B. While the edge properties (in square brackets) may be too abstract to show an end user (or even a bioinformatician performing graph-theoretiic operations), the core edge is meaningful and corresponds to how an anatomist or ordinary person might think of the relationship.

Maybe this is all pretty obvious, and many people loading bio-ontologies into either Neo4j or RDF end up treating edges as edges anyway. You can see the mapping we use in our SciGraph Neo4J OWL Loader, which is used by both Monarch Initiative and NIF Standard projects. The OLS Neo4J representation is similar. Pretty much anyone who has loaded the GO into a graph database has done the same thing, ignoring the OWL to RDF mapping. The same goes for the current wave of Knowledge Graph embedding based machine learning approaches, which typically embed a simpler graphical representation.

So problem solved? Unfortunately, everyone is doing this differently, and are essentially throwing out OWL altogether. We lack a standard way to map OWL into Property Graphs, so everyone invents their own. This is also true for people using RDF stores, people often have their own custom OWL mapping that is less verbose. In some cases this is semantically dubious, as is the case for the Wikipedia mapping.

The simple thing is for everyone to get around a common standard mapping, and RDF* seems a good foundation. Even if you are using plain RDF, you could follow this standard and choose to map edge properties to reified nodes, or to named graphs, or to the Wikidata model. And if you are using a graph database like Neo4J, there is a straightforward mapping to edge properties.

I will call this mapping OWL*, and it may look something like this:

RDF* OWL Interpretation
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfSomeValuesFrom ?c SubClassOf ?p some ?d
<<?c ?p ?d>> owlstar:interpretation owlstar:subClassOfQCR, owlstar:cardinality ?n ?c SubClassOf ?p exactly 5 ?d
<<?c ?p ?d>>  owlstar:subjectContextProperty ?cp, owlstar:subjectContextFiller ?cf, owlstar:interpretation owlstar:subClassOfSomeValuesFrom (?c and ?cp some cf?) SubClassOf ?p some ?d

Note that the code of each of these mappings is a single edge/triple between class c, class d, and an edge label p. The first row is a standard existential restriction common to many ontologies. The second row is for statements such as ‘hand has part 5 fingers’, which is still essentially a link between a hand concept and a finger concept. The 3rd is for a GCI, an advanced OWL concept which turns out to be quite intuitive and useful at the graph level, where we are essentially contextualizing the statement. E.g. in developmentally normal adult humans (context), hand has-part 5 finger.

When it comes to a complete encoding of all of OWL there may be decisions to be made as to when to introduce blank nodes vs cramming as much into edge properties (e.g. for logical definitions), but even having a standard way of encoding subclass plus quantified restrictions would be a huge boon.

Bonus: Explicit deferral of semantics where required

Many biological relationships expressed in natural language in forms such as “Lmo-2 binds to Elf-2” or “crocodiles eat wildebeest” can cause formal logical modelers a great deal of trouble. See for example “Lmo-2 interacts with Elf-2”On the Meaning of Common Statements in Biomedical Literature (also slides) which lays out the different ways these seemingly straightforward statements about classes can be modeled. This is a very impressive and rigorous work (I will have more to say on how this aligns with GO-CAM in a future post), and ends with an impressive Wall of Logic:

Screen Shot 2019-07-08 at 10.16.38 PM.png
Dense logical axioms proposed by Schulz & Jansen for representing biological interactions

this is all well and good, but when it comes to storing the biological knowledge in a database, the majority of developers are going to expect to see this:

Mungalls-Ontology-Design-Guidelines (6).png
protein interaction represented as a single edge connecting two nodes, as represented in every protein interaction database

And this is not due to some kind of semantic laziness on their part: representing biological interactions using this graphical formalism (whether we are representing molecular interactions or ecological interactions) allows us to take advantage of powerful graph-theoretic algorithms to analyze data that are frankly much more useful than what we can do with a dense FOL representation.

I am sure this fact is not lost on the authors of the paper who might even regard this as somewhat trivial, but the point is that right now we don’t have a standard way of serializing more complex semantic expressions into the right graphs. Instead we have two siloed groups, one from a formal perspective producing complex graphs with precise semantics, and the other producing useful graphs with no encoding of semantics.

RDF* gives us the perfect foundation for being able to directly represent the intuitive biological statement in a way that is immediately computationally useful, and to adorn the edges with additional triples that more precisely state the desired semantics, whether it is using the Schulz FOL or something simpler (for example, a simple some-some statement is logically valid, if inferentially weak here).

Beyond FOL

There is no reason to have a single standard for specifying semantics for RDF* and PGs. As hinted in the initial example, there could be a vocabulary or series of vocabularies for making probabilistic assertions, either as simple assignments of probabilities or frequencies, e.g.

<<:RhinovirusInfection :has-symptom :RunnyNose>> probstar:hasFrequency
0.75 .

or more complex statements involving conditional probabilities between multiple nodes (e.g. probability of symptom given disease and age of patient), allowing encoding of ontological Bayesian networks and Markov networks.

We could also represent contextual knowledge, using a ‘that’ construct borrowed from ILK:

<<:clark_kent owl:sameAs :superman>> a ikl:that ; :believed-by :lois_lane .

which could be visually represented as:

Mungalls-Ontology-Design-Guidelines (10)
Lois Lane believes Clark Kent is Superman. Here an edge has a link to another node rather than simply literals. Note that while possible in RDF*, in some graph databases such as Neo4j, edge properties cannot point directly to nodes, only indirectly through key properties. In other hypergraph-based graph DBs a direct link is possible.

Proposed Approach

What I propose is a series of lightweight vocabularies such as my proposed OWL*, accompanied by mapping tables such as the one above. I am not sure if W3C is the correct approach, or something more bottom-up. These would work directly in concert with RDF*, and extensions could easily be provided to work with various ways to PG-ify RDF, e.g. reification, Wikidata model, NGs.

The same standard could work for any PG database such as Neo4J. Of course, here we have the challenge of how to best to encode IRIs in a framework that does not natively support these, but this is an orthogonal problem.

All of this would be non-invasive and unobtrusive to people already working with these, as the underlying structures used to encode knowledge would likely not change, beyond an additional adornments of edges. A perfect stealth standard!

It would help to have some basic tooling around this. I think the following would be straightforward and potentially very useful:

  • Implementation of the OWL* mapping of existing OWL documents to RDF* in tooling – maybe the OWLAPI, although we are increasingly looking to Python for our tooling (stay tuned to hear more on funowl).
  • This could also directly bypass RDF* and go directly to some PG representation, e.g. networkx in Python, or stored directly into Neo4J
  • Some kind of RDF* to Neo4J and SPARQL* to OpenCypher [which I assume will happen independently of anything proposed here]
  • And OWL-RL* reasoner that could demonstrate simple yet powerful and useful queries, e.g. property chaining in Wikidata

A rough sketch of this approach was posted on public-owl-dev to not much fanfare, but, umm, this may not be the right forum for this.

Glossing over the details

For a post about semantics, I am glossing over the semantics a bit, at least from a formal computer science perspective. Yes of course, there are some difficult details to be worked out regarding the extent to which existing RDF semantics can be layered on, and how to make these proposed layers compatible. I’m omitting details here to try and give as simple an overview as possible. And it also has to be said, one has to be pragmatic here. People are already making property graphs and RDF graphs conforming to the simple structures I’m describing here. Just look at Wikidata and how it handles (or rather, ignores) OWL. I’m just the messenger here, not some semantic anarchist trying to blow things up. Rather than worrying about whether such and such a fragment of FOL is decidable (which lets face it is not that useful a property in practice) let’s instead focus on coming up with pragmatic standards that are compatible with the way people are already using technology!

OBO-Format · Ontologies · OWLAPI · Tools · Uncategorized

A developer-friendly JSON exchange format for ontologies

OWL2 ontologies can be rendered using a number of alternate concrete forms / syntaxes:

  • Manchester Syntax
  • Functional Syntax
  • OWL-XML
  • RDF/XML
  • RDF/Turtle

All of the above are official W3 recommendations. If you aren’t that familiar with these formats and the differences between them, the W3 OWL Primer is an excellent starting point. While all of the above are semantically equivalent ways to serialize OWL (with the exception of Manchester, which cannot represent some axiom types), there are big pragmatic differences in the choice of serialization. For most developers, the most important differentiating factor is  support for their language of choice.

Currently, the only language I am aware of with complete support for all serializations is java, in the form of the OWLAPI. This means that most heavy-duty ontology applications use java or a JVM language (see previous posts for some examples of JVM frameworks that wrap the OWLAPI).

Almost all programming languages have support for RDF parsers, which is one reason why the default serialization for OWL is usually an RDF one. In theory it makes it more accessible. However, RDF can be a very low level way to process ontologies. For certain kinds of operations, such as traversing a simple subClassOf hierarchy, it can be perfectly fine. However, even commonly encountered constructs such as “X SubClassOf part-of some Y” are very awkward to handle, involving blank nodes (see the translation here). When it comes to something like axiom annotations (common in OBO ontologies), things quickly get cumbersome. It must be said though that using an RDF parser is always better than processing an RDF/XML file using an XML parser. This is two levels of abstraction too low, never do this! You will go to OWL hell. At least you will not be in the lowest circle – this is reserver for people who parse RDF/XML using an ad-hoc perl regexp parser.

Even in JVM languages, an OWL-level abstraction can be less than ideal for some of the operations people want to do on a biological ontology. These operations include:

  • construct and traverse a graph constructed from SubClassOf axioms between either pairs of named classes, or named-class to existential restriction pairs
  • create an index of classes based on a subset of lexical properties, such as labels and synonyms
  • Generate a simple term info page for showing in a web application, with common fields like definition prominently shown, with full attribution for all axioms
  • Extract some subset of the ontology

It can be quite involved doing even these simple operations using the OWLAPI. This is not to criticize the OWLAPI – it is an API for OWL, and OWL is in large part a syntax for writing set-theoretic expressions constraining a world of models. This is a bit of a cognitive mismatch for a hierarchy of lexical objects, or a graph-based organization of concepts, which is the standard  abstraction for ontologies in Bioinformatics.

There are some libraries that provide useful convenience abstractions – this was one of the goals of OWLTools, as well as The Brain. I usually recommend a library such as one of these for bioinformaticians wishing to process OWL files, but it’s not ideal for everyone. It introduces yet another layer, and still leaves out non-JVM users.

For cases where we want to query over ontologies already loaded in a database or registry, there are some good abstraction layers – SciGraph provides a bioinformatician-friendly graph level / Neo4J view over OWL ontologies. However, sometimes it’s still necessary to have a library to parse an ontology fresh off the filesystem with no need to start up a webservice and load in an ontology.

What about OBO format?

Of course, many bioinformaticians are blissfully unaware of OWL and just go straight to OBO format, a format devised by and originally for the Gene Ontology. And many of these bioinformaticians seem reasonably content to continue using this – or at least lack the activation energy to switch to OWL (despite plenty of encouragement).

One of the original criticisms of Obof was it’s lack of formalism, but now Obof has a defined mapping to OWL, and that mapping is implemented in the OWLAPI. Protege can load and save Obof just as if it were any other OWL serialization, which it effectively is (without the W3C blessing). It can only represent a subset of OWL, but that subset is actually a superset of what most consumers need. So what’s the problem in just having Obof as the bioinformaticians format, and ontologists using OWL for heavy duty ontology lifting?

There are a few:

  • It’s ridiculously easy to create a hacky parser for some subset of Obof, but it’s surprisingly hard to get it right. Many of the parsers I have seen are implemented based on the usual bioinformatics paradigm of ignoring the specs and inferring a format based on a few examples. These have a tendency to proliferate, as it’s easier to write your own that deal with figuring out of someone else’s fits yours. Even with the better ones, there are always edge cases that don’t conform to expectations. We often end up having to normalize Obof output in certain ways to avoid breaking crappy parsers.
  • The requirement to support Obof leads to cases of tails wagging the dog, whereby ontology producers will make some compromise to avoid alienating a certain subset of users
  • Obof will always support the same subset of OWL. This is probably more than what most people need, but there are frequently situations where it would be useful to have support for one extra feature – perhaps blank nodes to support one level of nesting an an expression.
  • The spec is surprisingly complicated for what was intended to be a simple format. This can lead to traps.
  • The mapping between CURIE-like IDs and semantic web URIs is awkwardly specified and leads to no end of confusion when the semantic web world wants to talk to the bio-database world. Really we should have reused something like JSON-LD contexts up front. We live and learn.
  • Really, there should be no need to write a syntax-level parser. Developers expect something layered on XML or JSON these days (more so the latter).

What about JSON-LD?

A few years ago I asked on the public-owl-dev list if there were a standard JSON serialization for OWL. This generated some interesting discussion, including a suggestion to use JSON-LD.

I still think that this is the wrong level of abstraction for many OWL ontologies. JSON-LD is great and we use it for many instance-level representations but as it suffers from the same issues that all RDF layerings of OWL face: they are too low level for certain kinds of OWL axioms. Also, JSON-LD is a bit too open-ended for some developers, as graph labels are mapped directly to JSON keys, making it hard to map.

Another suggestion on the list was to use a relatively straightforward mapping of something like functional/abstract syntax to JSON. This is a great idea and works well if you want to implement something akin to the OWL API for non-JVM languages. I still think that such a format is important for increasing uptake of OWL, and hope to see this standardized.

However, we’re still back at the basic bioinformatics use case, where an OWL-level abstraction doesn’t make so much sense. Even if we get an OWL-JSON, I think there is still a need for an “OBO-JSON”, a JSON that can represent OWL constructs, but with a mapping to structures that correspond more closely to the kinds of operations like traversing a TBox-graph that are common in life sciences applications.

A JSON graph-oriented model for ontologies

After kicking this back and forth for a while we have a proposal for a graph-oriented JSON model for OWL, tentatively called obographs. It’s available at https://github.com/geneontology/obographs

The repository contains the start of documentation on the structural model (which can be serialized as JSON or YAML), plus java code to translate an OWL ontology to obograph JSON or YAML.

Comments are more than welcome, here or in the tracker. But first some words concerning the motivation here.

The overall goals was to make it easy to do the 99% of things that bioinformatics developers usually do, but without throwing the 1% under the bus. Although it is not yet a complete representation of OWL, the initial design is allowed to extend things in this direction.

One consequence of this is that the central object is an existential graph (I’ll get to that term in a second). We call this subset Basic OBO Graphs, or BOGs, roughly corresponding to the OBO-Basic subset of OBO Format. The edge model is pretty much identical to every directed graph model out there: a set of nodes and a set of directed labeled edges (more on what can be attached to the edges later). Here is an example of a subset of two connected classes from Uberon:

"nodes" : [
    {
      "id" : "UBERON:0002102",
      "lbl" : "forelimb"
    }, {
      "id" : "UBERON:0002101",
      "lbl" : "limb"
    }
  ],
  "edges" : [
    {
      "subj" : "UBERON:0002102",
      "pred" : "is_a",
      "obj" : "UBERON:0002101"
    }
  ]

So what do I mean by existential graph? This is the graph formed by SubClassOf axioms that connect named classes to either names class or simple existential restrictions. Here is the mapping (shown using the YAML serialization – if we exclude certain fields like dates then JSON is a straightforward subset, so we can use YAML for illustrative purposes):

Class: C
  SubClassOf: D

==>

edges:
 - subj: C
   pred: is_a
   obj: D
Class: C
  SubClassOf: P some D

==>

edges:
 - subj: C
   pred: P
   obj: D

These two constructs correspond to is_a and relationship tags in Obof. This is generally sufficient as far as logical axioms go for many applications. The assumption here is that these axioms are complete to form a non-redundant existential graph.

What about the other logical axiom and construct types in OWL? Crucially, rather than following the path of a direct RDF mapping and trying to cram all axiom types into a very abstract graph, we introduce new objects for increasingly exotic axiom types – supporting the 1% without making life difficult for the 99%. For example, AllValuesFrom expressions are allowed, but these don’t get placed in the main graph, as typically these do not getoperated on in the same way in most applications.

What about non-logical axioms? We use an object called Meta to represent any set of OWL annotations associated with an edge, node or graph. Here is an example (again in YAML):

  - id: "http://purl.obolibrary.org/obo/GO_0044464"
    meta:
      definition:
        val: "Any constituent part of a cell, the basic structural and functional\
          \ unit of all organisms."
        xrefs:
        - "GOC:jl"
      subsets:
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goantislim_grouping"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gosubset_prok"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#goslim_pir"
      - "http://purl.obolibrary.org/obo/go/subsets/nucleus#gocheck_do_not_annotate"
      xrefs:
      - val: "NIF_Subcellular:sao628508602"
      synonyms:
      - pred: "hasExactSynonym"
        val: "cellular subcomponent"
        xrefs:
        - "NIF_Subcellular:sao628508602"
      - pred: "hasRelatedSynonym"
        val: "protoplast"
        xrefs:
        - "GOC:mah"
    type: "CLASS"
    lbl: "cell part"

 

Meta objects can also be attached to edges (corresponding to OWL axiom annotations), or at the level of a graph (corresponding to ontology annotations). Oh, but we avoid the term annotation, as that always trips up people not coming from a deep semweb/OWL background.

As can be seen commonly used OBO annotation properties get their own top level tag within a meta object, but other annotations go into a generic object.

BOGs and ExOGs

What about the 1%? Additional fields can be used, turning the BOG into an ExOG (Expressive OBO graph).

Here is an example of a construct that is commonly used in OBOs, primarily used for the purposes of maintaining an ontology, but increasingly used for doing more advanced discovery-based inference:

Class: C
EquivalentTo: G1 and ... and Gn and (P1 some D1) and ... and (Pm some Dm)

Where all variables refer to named entities (C, Gi and Di are classes, Pi are Object Properties)

We translate to:

 nodes: ...
 edges: ...
 logicalDefinitionAxioms:
  - definedClassId: C
    genusIds: [G1, ..., Gn]
    restrictions:
    - propertyId: P1 
      fillerId: D1
    - ...
    - propertyId: Pm 
      fillerId: Dm

Note that the above transform is not expressive enough to capture all equivalence axioms. Again the idea is to have a simple construct for the common case, and fall-through to more generic constructs.

Identifiers and URIs

Currently all the examples in the repo use complete URIs, but this in progress. The idea is that the IDs commonly used in bioinformatics databases (e.g GO:0008150) can be supported, but the mapping to URIs can be made formal and unambiguous through the use of an explicit JSON-LD context, and one or more default contexts. See the prefixcommons project for more on this. See also the prefixes section of the ROBOT docs.

Documentation and formal specification

There is as yet no formal specification. We are still exploring possible shapes for the serialization. However, the documentation and examples provided should be sufficient for developers to grok things fairly quickly, and for OWL folks to get a sense of where we are going. Here are some things potentially useful for now:

Tools

The GitHub repo also houses a reference implementation in Java, plus an OWL to JSON converter script (reverse is not yet implemented). The java implementation can be used as an object model in its own right, but the main goal here is to make a serialization that is easy to use from any language.

Even without a dedicated API, operations are easy with most languages. For example, in python to create a mapping of ids to labels:

f = open('foo.json', 'r') 
obj = json.load(f)

lmap = {}
for g in gdoc.graphs:
  for n in g.nodes:
    lmap[n.id] = n.lbl

Admittedly this particular operation is relatively easy with rdflib, but other operations become more awkward (and not to mention the disappointing slow performance of rdflib).

There are a number of applications that already accept obographs. The central graph representation (the BOG) corresponds to a bbop-graph. This is the existential graph representation we have been using internally in GO and Monarch. The SciGraph API sends back bbop-graph objects as default.

Some additional new pieces of software supporting obographs:

  • noctua-reasoner – a javascript reasoner supporting a subset of OWL-RL, intended for client-side reasoning in browsers
  • obographviz – generation of dot files (and pngs etc) from obographs, allowing many of the same customizations as blipkit

Status

At this stage I am interested in comments from a wider community, both in the bioinformatics world, and in the semweb world.

Hopefully the former will find it useful, and will help wean people off of oboformat (to help this, ontology release tools like ROBOT and OWLTools already or will soon support obograph output, and we can include a json file for every OBO Library ontology as part of the central OBO build).

And hopefully the latter will not be offended too much by the need to add yet another format into the mix. It may even be useful to some parts of the OWL/semweb community outside bioinformatics.

 

infrastructure · Ontologies · Ontology Development Kit · Tools · Uncategorized

Creating an ontology project, an update

  • In a previous post, I recommended some standard ways of managing the various portions of an ontology project using a version control system like GitHub.

Since writing that post, I’ve written a new utility that makes this task even easier. With the ontology-starter-kit you can generate all your project files and get set up for creating your first release in minutes. This script takes into account some changes since the original post two years ago:

  • Travis-CI has become the de-facto standard continuous integration system for performing unit tests on any project managed in GitHub (for more on CI see this post). The starter-kit will give you a default travis setup.
  • Managing your metadata and PURLs on the OBO Library has changed to a GitHub-based system:
  • ROBOT has emerged as a simpler way of managing many aspects of a release process, particularly managing your external imports

Getting started

To get started, clone or download cmungall/ontology-starter-kit

Currently, you will need:

  • perl
  • make
  • git (command line client)

For best results, you should also download owltools, oort and robot (in the future we’ll have a more unified system)

You can obtain all these by running the install script:

./INSTALL.sh

This should be run from within the ontology-starter-kit directory

Then, from within that directory, you can seed your ontology:

./seed-my-ontology-repo.pl  -d ro -d uberon -u obophenotype -t cnidaria-ontology cnido

 

This assumes that you are building some kind of extension to uberon, using the relation ontology (OBO Library ontology IDs must be used here), that you will be placing this in the https://github.com/obophenotype/ organization  and that the repo name in obophenotype/cnidaria-ontology, and that IDs will be of the form CNIDA:nnnnnnn

After running, the repository will be created in the target/cnidaria-ontology folder, relative to where you are. You can move this out to somewhere more convenient.

The script is chatty, and it informs of you how it is copying the template files from the template directory into the target directory. It will create your initial source setup, including a makefile, and then it will use that makefile to create an initial release, going so far as to init the git repo, add and commit files (unless overridden). It will not go as far as to create a repo for you on github, but it provides explicit instructions on what you should do next:


EXECUTING: git status
# On branch master
nothing to commit, working directory clean
NEXT STEPS:
0. Examine target/cnidaria-ontology and check it meets your expectations. If not blow it away and start again
1. Go to: https://github.com/new
2. The owner MUST be obophenotype. The Repository name MUST be cnidaria-ontology
3. Do not initialize with a README (you already have one)
4. Click Create
5. See the section under '…or push an existing repository from the command line'
E.g.:
cd target/cnidaria-ontology
git remote add origin git@github.com:obophenotype/cnido.git
git push -u origin master

Note also that it also generates a metadata directory for you, with .md and .yml files you can use for your project on obolibrary (of course, you need to request your ontology ID space first, but you can go ahead and make a pull request with these files).

Future development

The overall system may no longer be necessary in the future, if we get a complete turnkey ontology release system with capabilities similar to analogous tools in software development such as maven.

For now, the Makefile approach is most flexible, and is widely understood by many software developers, but a long standing obstacle has been the difficulty in setting up the Makefile for a new project. The starter kit provides a band-aid here.

If required, it should be possible to set up alternate templates for different styles of project layouts. Pull requests on the starter-kit repository are welcome!

 

 

OBO-Format · Ontologies · Phenotype Ontologies · Reasoning · Tutorials

Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

Announcements · Ontologies · OWLAPI · Tools

owljs – a javascript library for OWL hacking

owljs ia a javascript library for doing stuff with OWL. It’s available from github:

https://github.com/cmungall/owljs

Whilst it attempts to following CommonJS, you currently have to use RingoJS  (a Rhino engine) as it makes use of JVM calls to the OWLAPI

owl plus rhino equals fun

 

Why javascript?

Why javascript you may ask? Isn’t that a hacky language run in browsers? In fact javascript is increasingly used on the server side as well as in browsers, as can be seen in the success of node.js. With Java 8 incorporating Nashorn as a scripting engine, it looks like javascript on the server side is here to stay.

Why not just use java? Java can be very verbose and is not the ideal language for writing short ontology processing scripts, especially with the OWL API.

There are a number of other languages better suited to scripting and declarative programming in general, many of which run on the JVM. This includes

  • Groovy – a popular choice for interfacing with the OWL API
  • The Armed Bear flavor of Common Lisp, as used in LSW2.
  • Clojure, a variant of lisp, as used in Phil Lord’s powerful Tawny-OWL framework.
  • Scala, a superbly elegant functional programming language used to great effect in Jim Balhoff’s beautifully elegant scowl.
  • Iron Python – a popular choice for interfacing with the Brain. And of course, Python is the de facto language for bioinformatics these days

There are also offerings in non-JVM languages such as my own posh – in addition most languages provide some kind of RDF library, but this can often be low level for working in OWL.

I decided to write a javascript library for a number of reasons. Our group already produces a lot of javascript code, most of which can be run on the server. For example, the golr libraries used in the AmiGO 2 codebase are CommonJS, as are those used for the Monarch API. Thse APIs all access ontologies through services (and can thus be run on a non-JVM javascript engine like node), and we would not make these APIs depend on a JVM. However, the ability to go the other way is useful – in a powerful ontology processing environment that offers all the features of the OWL API, being able to access all kinds of bioinformatics data through ready-made javascript APIs.

Another reason is that JSON is ubiquitous, and having your data format be a subset of the language has some major advantages.

Plus, after an initial period of ambivalence, I have grown to really like javascript. It’s just functional enough to do some cool things.

What can you do with it?

I hope to provide some specific examples later on this blog. For now, take a look at the docs on github. Major features are:

Stay tuned for more information!

 

 

 

 

infrastructure · OBO-Format · Ontologies · OWLAPI · Tools · Uberon

The perils of managing OWL in a version control system

Background

Version Control Systems (VCSs) are commonly used for the management
and deployment of biological ontologies. This has many advantages,
just as is the case for software development. Standard VCS
environments and hosting solutions like github provide a wealth of
features including easy access to historic versions, branching, forking, diffs, annotation of changes, etc.

VCS systems also integrate well with Continuous Integration systems.
For example, a CI system can be configured to run a series of checks and even publish, triggered by a git commit/push.

OBO Format was designed with VCSs in mind. One of the main guiding
principles was that ontologies should be diffable. In order to
guarantee this, the OBO format specifies a recommended tag ordering
ensuring that serialization of an ontology into a file is
deterministic. OBO format was also designed such that ascii-level
diffs were as human readable as possible.

OBO Format is a deprecated format – I recommend groups switch to using
one of the W3C concrete forms of OWL. However, this comes with one
caveat – if the source (editors) version of an ontology is switched
from obo to any other OWL serialization, then human-readable diffs are
lost. Additionally, the non-deterministic serialization of the
ontology results in spurious diffs that not only hamper
human-readability, but also cause bottlenecks in VCS. As an example,
releasing a version of the Uberon ontology can consume over an hour
simply performing SVN operations.

The issue of human-readability is being addressed by a working group
to extend Manchester Syntax (email me for further details). Here I
focus not on readability of diffs, but on the size of diffs, as this
is an important aspect of managing an ontology in a VCS.

Methods

I measured the “diffability” of different OWL formats by taking a
mid-size ontology incorporating a wide range of OWL constructs
(Uberon) and measuring
size of diffs between two ontology versions in relation to the change in
the number of axioms.

Starting with the 2014-03-28 release of Uberon, I iteratively removed
axioms from the ontology, saved the ontology, and measured the size of
the diff. The diff size was simply the number of lines output using
the unix diff command (“wc -l”).

This was done for the following OWL formats: obo, functional
notation (ofn), rdf/xml (owl), turtle (ttl) and Manchester notation
(omn). The number of axioms removed was 1, 2, 4, 8, .. up to
2^16. This was repeated ten times.

The OWL API v3 version 0.2.1-SNAPSHOT was used for all serializations,
except for OBO format, which was performed using the 2013-03-28
version of oboformat.jar. OWLTools was used as the command line
wrapper.

Results

The results can be downloaded HERE, and are plotted in the following
figure.

 

Plot showing size of diffs in relation to number of axioms added/removed
Plot showing size of diffs in relation to number of axioms added/removed

As can be seen there is a marked difference between the two RDF
formats (RDF/XML and Turtle) and the dedicated OWL serializations
(Manchester and Functional), which have roughly similar diffability to
OBO format.

In fact the diff size for RDF formats is both constant and large
regardless of the size of the diff. This appears to be due to
non-determinism when serializing axiom annotations.

This analysis only considers a single ontology, and a single version of the OWL API.

Discussion and Conclusions

Based on these results, it would appear to be a huge mistake to ever
manage an RDF serialization of OWL in a VCS. Using Manchester or
Functional gives superior diffability, with the number of axiom
changed proportional to size of the diff. OBO format offers human
readability of diffs as well, but this format is limited in
expressivity.

These recommendations are consistent with the size of the file in each format.

The following numbers are for Uberon:

  • obo 11M
  • omn 28M
  • ofn 37M
  • owl 53M
  • ttl 58M

However, one issue here is that RDF-level tools may not accept a
dedicated OWL serialization such as ofn or omn. Most RDF libraries
will however, accept RDF/XML or Turtle.

The ontology manager is then faced with a quandary – cut themselves
off from a segment of the semantic web and have diffs that are
manageable (if not readable) or live with enormous spurious diffs for
the benefits of SW integration.

The best solution would appear to be to manage source versions in a
diffable format, and release in a more voluminous RDF/semweb
format. This is not so different from software management – the users
consume a compile version of the software (jars, object files, etc)
and the software is maintained as diffable source. It’s generally
considered bad practice to check in derived products into a VCS.

However, this answer is not really satisfactory to maintainers of
ontologies, who lack tools as mature as those in the software
realm. We do not yet have the equivalent of Maven, CPAN, NPM, Debian,
etc for ontologies*. Modern ontologies have dependencies managed using
OWL imports that do not mesh well with simple repositories like
Bioportal that treat each ontology as a monolithic unit.

The approach I would recommend is therefore to adapt the RDF/XML
generator of the OWL API such that it is deterministic, or to write an
RDF roundtripper that always produces a determinstic
serialization. This should be coupled with ongoing efforts to add
human-readable class labels as comments to enhance readability of diffs.
Ideally the recommended deterministic serialization order would be formally
specified, such that different software (and different versions of the same
software) could adhere to it.

At the same time, we need to be working on analogs of maven and
package management systems in the ontology world.

 

Footnote:

Some ongoing efforts ito mavenize ontologies:

Updates:

 

 

 

 

 

infrastructure · Ontologies · Tools

Creating an ontology project

UPDATE: see the latest post on this subject.

 

This article describes how to manage all the various files created as part of an ontology project. This assumes some familiarity with the unix command line. The article does not describe how to create the actual content for your ontology. For this I recommend the material from the NESCENT course on building anatomy ontologies, put together by Melissa Haendel and Matt Yoder, as well as the OBO Foundry tutorial from ICBO 2013.

Some of the material here overlaps with the page on making a google code project for an ontology.

Let’s make a jelly project

For our purposes here, let’s say you’re a Cnidaria biologist and you’re all ready to start building an ontology that describes the anatomy and traits of this phylum. How do you go about doing this?

Portuguese man-of-war (Physalia physalis)
Portuguese man-of-war (Physalia physalis)

Well, you could just fire up Protege and start building the thing, keeping the owl file on your desktop, periodically copying the file to a website somewhere. But this isn’t ideal. How will you track your edits? How will you manage releases? What about imports from other ontologies (you do intend to import parts of other ontologies, don’t you? If the answer is “no” go back and read the course material above!).

It’s much better to start off on the right foot, keeping all your files organized according to a directory structure common layout, and making use of simple and sensible practices from software engineering.

OORT

As part of your release process you’ll make use of OWLTools and OORT, which can be obtained from the OWLTools google code repository.

Make sure you have OWLTools-Oort/bin/ in your PATH

The first thing to do is to create a directory on your machine for managing all your files – as an ontology developer you will be managing a collection of files, not just the core ontology file itself. We’ll call this directory the “ontology project”.

To make things easy, owltools comes with a handy script called create-ontology-project  to create a stub ontology project. This script is distributed with OWLTools but is available for download here:

http://owltools.googlecode.com/svn/trunk/OWLTools-Oort/bin/create-ontology-project

The first thing to do is select your ontology ID (namespace). This *must* be the same as the ID space you intend to use. So if your URIs/IDs are to be CNIDO_0000001 and so on, the ontology ID *must* be “cnido“. Note that whilst your IDs will be in SHOUTY CAPITALS, the actual ontology itself is all ~~gentle lowercase~~~, even the first letter. This is actually part of OBO Foundry ID policy.

Running the script

Now, type this on the command line:


create-ontology-project cnido

(you will need to add this to your path – it’s in the OWLTools-Oort/bin directory).

You will see the following output:


SUCCESS!
Directory Listing:
-----------------
cnido
cnido/doc
cnido/doc/README.txt
cnido/images
cnido/images/README.txt
cnido/LICENSE.txt
cnido/README.txt
cnido/src
cnido/src/ontology
cnido/src/ontology/catalog-v001.xml
cnido/src/ontology/CHANGES
cnido/src/ontology/diffs
cnido/src/ontology/diffs/Makefile
cnido/src/ontology/imports
cnido/src/ontology/imports/README.txt
cnido/src/ontology/cnido-edit.owl
cnido/src/ontology/cnido-idranges.owl
cnido/src/ontology/Makefile
cnido/tools
cnido/tools/README.txt

followed by:

What now?
* Create a git or svn project. E.g.:
cd cnido
git init
git add -A
git commit -m 'initial commit'
* Now visit github.com and create project cnido-ontology
* Edit ontology and create first release
cd cnido/src/ontology
make initial-build
* Create a jenkins job

What next?

You may not need all the stub files that are created from the outset, but it’s a good idea to have them there from the outset, as you may need them in future.

I recommend your first step is to follow the instructions above to (1) initiate a local git repository by typing the 4 commands above (2) publish this on github. You will need to go to github.com, create an account, create a project, and select “create a project from an existing repository“. (more on this later).

Once this is done, you can start modifying the stub files.

The top level README provides a high level overview of your project. You should describe the content and use cases here. You can edit this in a normal text editor — alternately, if you intend to use github (recommended) then you can wait until you commit and push this file and then edit it via the github web interface.

You will probably spend most of your time in the src/ontology directory, editing cnido-edit.owl

id-ranges

If you intend to have multiple people editing, then the cnido-idranges.owl file will be essential. You can edit this directly in Protege (but it may actually be easier to edit the file in a text editor). Assign each editor an ID range here (just follow the existing file as an example). Note that currently Protege does not read this file, so this just serves as formal documentation.

In future, id-ranges may be superseded by urigen servers, but for now they provide a useful way of avoiding collisions.

Documentation

If you use github or another hosting solution like google code, you can use their wiki system. You should keep any files associated with the documentation (word docs, presentations, etc) in the doc/ folder. You can link to them directly from the wiki.

Images

You can ask the OBO admins to help you set up a purl redirect such that URLs of the form

http://purl.obolibrary.org/obo/cnido/images/

Will redirect to your images/ directory, which is where you will place any pictures of jelly fish or anenome body parts that you want to be associated with classes in the ontology (assuming you have the rights to do this). I recommend Jim Balhoff’s depictions plugin.

Imports

Managing your imports can be a difficult task and deserves its own article.

For now you can browse the ctenophore-ontology project to see an example of a setup, in particular:

This setup uses the OWLAPI for imports, but others prefer to make use of OntoFox.

Releases

You can use OORT to create your release files. The auto-generated Makefile stub should be sufficient to manage a basic release pipeline.In the src/ontology directory, type this:


make all

This should make both cnido.obo and cnido.owl – these will be the files the rest of the world sees. cnido-edit is primarily seen by you and your fellow cnidarian-obsessed editors.

Caveats

Depending on the specific needs of your project, some of the defaults and stubs provided by the create-ontology-project script may not be ideal for you. Or you may simply prefer to create the directory structure manually, it’s not very hard – this is of course fine. The script is provided primarily to help you get started, hopefully it will prove useful.

Finally, if you know any cnidarian biologists interested in contributing to an ontology, let me know as we are lacking detailed coverage in existing ontologies!