Introduction to Protege and OWL for the Planteome project

As a part of the Planteome project, we develop common reference ontologies and applications for plant biology.

Planteome logo

As an initial phase of this work, we are transitioning from editing standalone ontologies in OBO-Edit to integrated ontologies using increased OWL axiomatization and reasoning. In November we held a workshop that brought together plant trait experts from across the world, and developed a plan for integrating multiple species-specific ontologies with a reference trait ontology.

As part of the workshop, we took a tour of some of the fundamentals of OWL, hybrid obo/owl editing using Protege 5, and using reasoners and template-based systems to automate large portions of ontology development.

I based the material on an earlier tutorial prepared for the Gene Ontology editors, it’s available on the Planteome GitHub Repository at:

https://github.com/Planteome/protege-tutorial

Advertisements

GO annotation origami: Folding and unfolding class expressions

With the introduction of Gene Association Format (GAF) v2, curators are no longer restricted to pre-composed GO terms – they can use a limited form of anonymous OWL Class Expressions of the form:

GO_Class AND (Rel_1 some V_1) AND (Rel_2 some V2)

The set of relationships is specified in column 16 of the GAF file.

However, many tools are not capable of using class expressions – they discard the additional information leaving only the pre-composed GO_Class.

Using OWLTools it is possible to translate a GAF-v2 set of associations and an ontology O to an equivalent GAF-v1 set of associations plus an analysis ontology O-ext. The analysis ontology O-ext contains the set of anonymous class expressions folded into named classes, together with equivalence axioms, and pre-reasoned into a hierarchy using Elk.

See http://code.google.com/p/owltools/wiki/AnnotationExtensionFolding

For example, given a GO annotation of a gene ‘geneA’:

gene: geneA
annotation_class:  GO:0006915 ! apoptosis
annotation_extension: occurs_in(CL:0000700) ! dopaminergic neuron

The folding process will generate a class with a non-stable URI, automatic label and equivalence axiom:

Class: GO/TEMP_nnnn
  Annotations: label "apoptosis and occurs_in some dopaminergic neuron"
  EquivalentTo: 'apoptosis' and occurs_in some 'dopaminergic neuron'
  SubClassOf: 'neuron apoptosis'

This class will automatically be placed in the hierarchy using the reasoner (e.g. under ‘neuron apoptosis’). For the reasoning step to achieve optimal results, the go-plus-dev.owl version should be used (see new GO documentation). A variant of this step is to perform folding to find a more specific subclass that the one used for direct annotation.

The reverse operation – unfolding – is also possible.  For optimal results, this relies on Equivalent Classes axioms declared in the ontology, so make sure to use the go-plus-dev.owl. Here an annotation to a pre-composed complex term (eg neuron apoptosis) is replaced by an annotation to a simpler GO term (eg apoptosis) with column 16 filled in (e.g. occurs_in(neuron).

The folding operation allows legacy tools to take some advantage of GO annotation extensions by generating an ‘analysis ontology’ (care must be taken in how this is presented to the user, if at all). Ideally more tools will use OWL as the underlying ontology model and be able to handle c16 annotations directly, ultimately requiring less pre-coordination in the GO.

 

Querying for connections between the GO and FMA

Can we query for connections between FMA and GO? This should be
possible by using a combination of

  • GO
  • Uberon
  • FMA
  • Axioms linking GO and Uberon (x-metazoan-anatomy)
  • Axioms linking FMA and Uberon (uberon-to-fma)

This may seem like more components than is necessary. However,
remember that GO is a multi-species ontology, and “heart development”
in GO covers not only vertebrate hearts, but also (perhaps
controversially) drosophila “hearts”. In contrast, the FMA class for
“heart” represents a canonical adult human heart. This is why we have
to go via Uberon, which covers similar taxonomic territory to GO. The
uberon class called “heart” covers all hearts.

GO to metazoan anatomical structures

http://purl.obolibrary.org/obo/go/extensions/x-metazoan-anatomy.owl contains axioms of the form:


'heart  EquivalentTo 'anatomical structure morphogenesis' and
'results in morphogenesis of' some uberon:heart

(note that sub-properties of ‘results in developmental progression of’
are used here)

Generic metazoan anatomy to FMA

http://purl.obolibrary.org/obo/uberon/bridge/uberon-bridge-to-fma.owl contains axioms of the form:


fma:heart EquivalentTo uberon:heart and part_of some 'Homo sapiens'

GO to FMA

Note that there is no existential dependence between go ‘heart
development’ and fma:heart. This is as it should be – if there were no
human hearts then there would still be heart development
processes. This issue is touched in Chimezie Ogbuji‘s presentation at DILS 2012.

This lack of existential dependence has consequences for querying
connections. An OWL query for:

?p SubClassOf ‘results in developmental progression of’ some ?u

Will return GO-Uberon connections only.

We must perform a join in order to get what we want:

?p SubClassOf ‘results in developmental progression of’ some ?u,
?a SubClassOf ?u,
?a part_of some ‘Homo sapiens’

Actually executing this query is not straightforward. Ideally we would
have a way of using OWL syntax, such as the above. To get complete
results, either EL++ or RL reasoning is required. In the next post I’ll present some possible options for issuing this query.

Elk disjoint hack

Elk is a blindingly fast EL++ reasoner. Unfortunately, it doesn’t yet support the full EL++ profile – in particular it lacks disjointness axioms. This is unfortunate, as these kinds of axioms are incredibly useful for integrity checking. See the methods section of the Uberon paper for some details on how partwise disjointness axioms were created.

However, Elk does support intersection and equivalence. This means we should be able to perform a translation:

DisjointClasses(x1, x2, …, xn) ⇒
EquivalentClasses(owl:Nothing IntersectionOf(xi xj)) for all i<j<=n

I asked about this on the Elk mail list – see  Satisfiability checking and DisjointClasses axioms

The problem is that whilst Elk supports intersection and equivalence, it doesn’t support Nothing. This means that there may be corner cases in which it doesn’t work.

Proper disjointness support may be coming in the next version Elk, but it’s been a few months so I decided to go ahead and implement the above translation in OWLTools (also available in Oort).

If we have an ontology such as foo.owl:

Ontology: <http://example.org/x.owl>

Class: :reasoner
Class: :animal
  DisjointWith: :reasoner

Class: :elk
  SubClassOf: :reasoner, :animal

We can translate it using owltools:

owltools foo.owl --translate-disjoints-to-equivalents -o file://`pwd`/foo-x.owl

Remeber, ordering of arguments is significant in owltools -make sure you translate *after* the ontology is loaded.

And then load this into Protege and reason over it using Elk. As expected, “elk” is unsatisfiable:

You can also do the checking directly in owltools:

owltools foo.owl --translate-disjoints-to-equivalents --run-reasoner -r elk -u

The “-u” option will check for unsatisfiable classes and exit with a nonzero code if any are found, allowing this to be used within a CI system like Jenkins (see this previous post).

You can also use this transform within Oort (command line version only):

ontology-release-runner --translate-disjoints-to-equivalents --reasoner elk foo.owl

Remember, there are corner cases where this translation will not work. Nevertheless, this can be useful as part of an “early warning” system, backed up by slower guaranteed checks running in the background with HermiT or some other reasoner.

Perhaps the ontologies I work with have a simpler structure, but so far I have found this strategy to be successful, identifying subtle part-disjointness problems, and not giving any false positives. There don’t appear to be any scalability problems, with Elk being its usual zippy self even when uberon is loaded with ncbitaxon/taxslim and taxon constraints translated into Nothing-axioms (~3000 disjointness axioms).

 

Taxon constraints in OWL

A number of years ago, the Gene Ontology database included such curiosities as:

  • A slime mold gene that had a function in fin morphogenesis
  • Chicken genes that were involved in lactation

These genes would be pretty fascinating, if they actually existed. Unfortunately, these were all annotation errors, arising from a liberal use of inference by sequence similarity.

We decided to adopt a formalism specified by Wacek Kusnierczyk[1], in which we placed taxon constraints on classes in the ontology, and used these to detect annotation errors[2].

The taxon constraints make use of two relations:

 

You can see examples of usage in GO either in QuickGO (e.g. lactation) , or by opening the x-taxon-importer.owl ontology in Protege. This ontology is used in the GO Jenkins environment to detect internal consistencies in the ontology.

The same relations are also in use in another multi-species ontology, Uberon[3].

 

In uberon, the constraints are used for ontology consistency checking, and to provide taxon subsets – for example, aves-basic.owl, which excludes classes such as mammary gland, pectoral fin, etc.

Semantics of the shortcut relations

In the Deegan et al paper we described a rule-based procedure for using the taxon constraint relations. This has the advantage of being scalable over large taxon ontologies and large gene association sets. But a better approach is to encode this directly as owl axioms and use a reasoner. For this we need to use OWL axioms directly, and we need to choose a particular way of representing a taxonomy.

Both relations make use of a class-based representation of a taxonomy such as ncbitaxon.owl or a subset such as taxslim.owl.

We can treat the taxon constraint relations as convenient shortcut relations which ‘expand’ to OWL axioms that capture the intended semantics in terms of a standard ObjectProperty “in_organism”. For now we leave in_organism undefined, but the basic idea is that for anatomical structures and cell components “in_organism” is the part_of parent that is an organism, whereas for processes it is the organism that encodes the gene products that execute the process.

In fact there are two ways to expand to the “in_organism” class axioms:

The more straightforward way:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism only ?Y
?X never_in_taxon ?Y ===> ?X SubClassOf in_organism only not ?Y

To achieve the desired entailments, it is necessary for sibling taxa to be declared disjoint (e.g. Eubacteria DisjointWith Eukaryota). Note that these disjointness axioms are not declared in the default NCBITaxon translation.

A different way which has the advantage of staying within the OWL2-EL subset:

?X only_in_taxon ?Y ===> ?X SubClassOf in_organism some ?Y
?X never_in_taxon ?Y ===> ?X DisjointWith in_organism some ?Y

This requires all sibling nodes (A,B) in the NCBI taxonomy to have a
General Axiom:

in_organism some ?A DisjointWith in_organism some ?B

These general axioms are automatically generated and available in taxslim-disjoint-over-in-taxon.owl

Taxon groupings

GO also makes use of taxon groupings – these include new classes such as “prokaryotes” which are defined using UnionOf axioms.. They are available in go-taxon-groupings.owl.

Taxon modules

One of the uses of taxon constraints is to build taxon-specific subsets of ontologies. This will be covered in a future post.

References

  1. Waclaw Kusnierczyk (2008) Taxonomy-based partitioning of the Gene Ontology, Journal of Biomedical Informatics
  2. Deegan Née Clark, J. I., Dimmer, E. C., and Mungall, C. J. (2010). Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development. BMC Bioinformatics 11, 530
  3. Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E., and Haendel, M. A. (2012) Uberon, an integrative multi-species anatomy ontology Genome Biology 13, R5. http://genomebiology.com/2012/13/1/R5

Ontologies and Continuous Integration

Wikipedia describes http://en.wikipedia.org/wiki/Continuous_integration as follows:

In software engineering, continuous integration (CI) implements continuous processes of applying quality control — small pieces of effort, applied frequently. Continuous integration aims to improve the quality of software, and to reduce the time taken to deliver it, by replacing the traditional practice of applying quality control after completing all development.

This description could – or should – apply equally well to ontology engineering, especially in contexts such as the OBO Foundry, where ontologies are becoming increasingly interdependent.

Jenkins is a web based environment for running integration checks. Sebastian Bauer, in Peter Robinson’s group had the idea of adapting Jenkins for performing ontology builds rather than software builds (in fact he used Hudson, but the differences between Hudson and Jenkins are minimal). He used  OORT as the tool to build the ontology — Oort takes in one or more ontologies in obo or owl, runs some non-logical and logical checks (via your choice of reasoner) and then “compiles” downstream ontologies in obo and owl formats. Converting to obo takes care of a number of stylistic checks that are non-logical and wouldn’t be caught by a reasoner (e.g. no class can have more than one text definition).

We took this idea and built our own Jenkins ontology build environment, adding ontologies that were of relevance to the projects we were working on. This turned out to be extraordinarily easy – Jenkins is very easy to install and configure, help is always just a mouse click away.

Here’s a screenshot of the main Jenkins dashboard. Ontologies have a blue ball if the last build was successful, a red ball otherwise. The weather icon is based on the “outlook” – lots of successful builds in a row gets a sunny icon. Every time an ontology is committed to a repository it triggers a build (we try and track the edit version of the ontology rather than the release version, so that we can provide direct feedback to the ontology developer). Each job can be customized – for example, if ontology A depends on ontology B, you might want to trigger a build of A whenever a new version of B is committed, allowing you to be forewarned if something in B breaks A.

main jenkins view

Below is a screenshot for the configuration settings for the go-taxon build – this is used to check if there are violations on the GO taxon constraints (dx.doi.org/10.1016/j.jbi.2010.02.002). We also include an external ontology of disjointness axioms (for various reasons its hard to include this in the main GO ontology). You can include any shell commands you like – in principle it would be possible to write a jenkins plugin for building ontologies using Oort, but for now you have to be semi-familiar with the command line and the Oort command line options:

config

Often when a job fails the Oort output can be a little cryptic – generally the protocol is to do detailed investigation using Protege and a reasoner like HermiT to track down the exact problem.

The basic idea is very simple, but works extremely well in practice. Whilst it’s generally better to have all checks performed directly in the editing environment, this isn’t always possible where multiple interdependent ontologies are concerned. The Jenkins environment we’ve built has proven popular with ontology developers, and we’d be happy to add more ontologies to it. It’s also fairly easy to set up yourself, and I’d recommend doing this for groups developing or using ontologies in a mission crticial way.

UPDATE: 2012-08-07

I uploaded some slides on ontologies and continuous integration to slideshare.

UPDATE: 2012-11-09

The article Continuous Integration of Open Biological Ontology Libraries is available on the Bio-Ontologies SIG KBlog site.

The size of Richard Nixon’s nose, part I

Consider a simple model of Richard Nixon:

Individual: :nixon
Types: :Organism
Facts: :has_part :nixons_nose

Individual: :nixons_nose
Types: :nose
Facts: :has_characteristic :nixons_nose_size

Individual: :nixons_nose_size
Types: :big

nixon haspart nose hasquality size

here’s the relations in our background ontology:

ObjectProperty: :has_part
Characteristics: Transitive

ObjectProperty: :has_characteristic
InverseOf:
:characteristic_of

ObjectProperty: :characteristic_of
InverseOf:
:has_characteristic

We have 3 entities: Nixon, his nose, and the characteristic or quality that is Richard Nixon’s nose size. We follow BFO here in individuating qualities: thus even if I had a nose of the “same” size as Richard Nixon, we would not share the same nose-size quality instance, we would each have our own unique nose-size quality instance (for a nice treatment, see Neuhaus et al [PDF]).

Now let’s look at a phenotypic label such as “big nose”. Intuitively we can see that this applies to Richard Nixon. But where exactly in this instance graph is the big nose phenotype? Is it the nose, the size, or Richard himself?

Specifically, if we have a phenotype ontology with a term “increased size of nose” or “big nose”, what OWL class expression do we assign as an equivalent? We have to make a decision as to where to root the path through our instance graph. It might be:

  • The nose: ie nose and has_characteristic some big
  • The size:  i.e. big and characteristic_of some nose
  • The organism: i.e. has_part some (nose and has_characteristic some big)
  • some unspecified thing that has a relationship to one of the above

The structure OWL class expression can be visualized as a path through the nixon graph:

Our decision affects the classification we get from reasoning. A big nose is part of a funny face, but in contrast a person with a big nose is a subclass of a person with a funny face. If you then put your reasoner results into a phenotype analysis you might get different results.

To an ordinary common sense person whose brain hasn’t been infected by ontologies, the difference between a “a nose that is increased in size” and an “increased size of nose” or a “person with a nose that’s increased in size” is just linguistic fluff, but the distinctions are important from an ontology modeling perspective.

Nevertheless, we may want to formalize the fact that we don’t care about these distinctions – we might want our “big nose” phenotype class to be any of the above.

One way would be to make fugly union classes, but this is tedious.

There is another way. We can introduce a generic “exhibits” relation. We elide a textual definition for now, the idea is that this relation captures the general notion of having a phenotype:

ObjectProperty: :exhibits
SubPropertyChain: :exhibits o :has_part
SubPropertyChain: :exhibits o :has_characteristic
SubPropertyChain: :exhibits o :characteristic_of
Characteristics: Transitive

We make this is super-relation of has_part:

ObjectProperty: :has_part
SubPropertyOf: :exhibits
Characteristics: Transitive

We can see exhibits is very promiscuous – when it connects to other relations, it makes a new exhibits relation.

How let’s make some probe classes illustrating the different ways we could define our “don’t care where we root the graph” phenotype:

Class: :test1
EquivalentTo: :exhibits some (:big and :characteristic_of some :nose)

Class: :test2
EquivalentTo: :exhibits some (:has_part some (:nose and :has_characteristic some :big))

Class: :test3
EquivalentTo: :exhibits some (:nose and :has_characteristic some :big)

Class: :test4
EquivalentTo: :has_part some (:nose and :has_characteristic some :big)

After running the reasoner we get the following inferred hierarchy:

-- test1=test3
---- test2
---- test4

So we can see we are collapsing the distinction between  “increased size of nose” and “nose that is increased in size” by instead defining a class “exhibiting an increased size of nose”.

If we then try the DL-query tab in Protege, we can see that the individual “nixon” satisfies all of these expressions.

Why is this important? It means we can join and analyze datasets without performing awkward translations. Group 1 can take a quality-centric approach, Group 2 can take an entity-centric approach, the descriptions or data from either of these groups will classify under the common “exhibits phenotype” class.

This works because of the declared inverse between has characteristic and characteristic of. Graphically we can think of this as “doubling back”:

Unfortunately, inverses put us outside EL++, so we can’t use the awesome Elk for classification.

Not-caring in ontologies is hard work!

What if we want to care even less, and formally have a “big nose phenotype” class classify either nixon, his nose, or the bigness that inheres in his nose? That’s the subject of the next post, together with some answers to the bigger question of “what is a phenotype”.