A common pain point for anyone working in bioinformatics is mapping identifiers. Many databases have overlapping content, or provide different data about the same entities, such as genes. Typically every database mints public identifiers in its own namespace. This means that the same gene may have multiple different identifiers in different databases (see for example some of the issues with SARS-CoV-2 protein identifiers). Anyone doing an analysis that combines data from different databases must do some kind of cross-walk or mapping.
Unfortunately mapping is fraught with problems. Simply finding the required mappings can be a challenge, and for any given pair of databases there may be different mappings from different providers. The provider may be the source databases, or a 3rd party provider such as BridgeDb. The meaning of a mapping may not be clear: does a mapping denote equivalence, or at least a 1:1 relationship? This is particularly challenging when trying to build a knowledge graph from multiple sources, where we want to merge information about the same entity.
Mappings are a big deal for ontologies too. There is an entire field of ontology alignment/matching. In theory ontologies should be able to make the meaning of mappings explicit, yet somehow we have messed this up, providing multiple alternate ways to say the same thing (OWL logical expressions, SKOS, and classic loose bioinformatics ‘dbxrefs’).
Within the Open Bio Ontologies project we attempted to avoid the mapping issue by promoting reuse of ontologies and concepts from ontologies — including reusing identifiers/URIs. Reuse is a standard concept in software engineering, and I’ve written before about (re)using software concepts in ontology engineering. However, not all ontologies are in OBO, and not all ontologies in OBO are perfectly modular and non-overlapping, so mapping remains a necessary evil.
Overall there are a multitude of headaches and challenges associated with mappings. One tractable chunk we have tried to break off recently is to come up with a standard exchange format for mappings. Currently mappings are distributed in ad-hoc formats, and there is no standard way of providing metadata about the mappings (who provided them, when, how, what is their quality, etc).
SSSOM: A Simple Shared Standard for Ontology Mapping
We have recently come up a proposed standard for mappings. We came up with the name SSSOM. A few initial comments about the name:
- it stands for Simple Shared Standard for Ontology Mapping. However, I believe it’s completely applicable to any named entity, whether modeled ontologically or not. You can use SSSOM for genes, proteins, people
- Yes, that is 3 leading ‘S’s. I have a colleague who calls it ‘slytherin‘
Details can be found in the SSSOM repo. I’ll provide a brief summary here.
The primary manifestation of SSSOM is as a TSV (tab-separate value) file. However, the data model is independent of the serialization format, and it can also be modeled in OWL, and any OWL serialization is possible (including JSON-LD).
SSSOM provides a standard way to describe a lot of rich information about mappings, but it allows you to be lazy and provide a minimum amount of information. The following example shows some mappings between the human phenotype ontology and the mouse phenotype ontology:
Note that in this case we are using SKOS as mapping predicates, but other predicates can be used.
Identifiers and mapping set metadata
SSSOM does require that all entities are written as CURIEs, with defined prefixes. The prefix expansions to URIs are written in the header in the same way you would for a format like RDF/Turtle.
For the above example, the source example TSV can be found here. Note the header:
#creator_id: "https://orcid.org/0000-0002-7356-1779" #curie_map: # HP: "http://purl.obolibrary.org/obo/HP_" # MP: "http://purl.obolibrary.org/obo/MP_" # skos: "http://www.w3.org/2004/02/skos/core" #license: "https://creativecommons.org/publicdomain/zero/1.0/" #mapping_provider: "http://purl.obolibrary.org/obo/upheno.owl"
The header is escaped by hash-quote marks, and is in YAML format. The curie_map tag provides expansions of prefixes to base URIs. It is recommended you use standard prefixes, such as the OBO prefixes (for ontologies) or the Biolink Model (which incorporates OBO, as well as other sources like identifiers.org via prefixcommons).
The header also allows for many other pieces of metadata about the mapping set. Inclusion of an explicit license is encouraged – and I would recommend CC-0 for all mappings.
The complete set of elements that can be used to describe a mapping can be found in the relevant section of the spec. Some of these can be used for individual mappings, some can be applied in the header if they apply to all.
Some elements to call out:
- Each mapping can have an associated confidence score, between zero and one. This can be useful for probabilistic OWL based ontology merging approaches, e.g. LogMap and kBOOM/Boomer.
- Mappings and mapping sets can have provenance – e.g. orcid of the mapping_provider, as well as the date of the mapping
- Information about how the match was made, such as the mapping tool, the type of match (e.g. automated lexical vs curated). We have developed a controlled vocabulary for this. For lexical matches, you can also indicate what property was matched (e.g a match may be based on a shared oboInOwl:hasDbXref in common, a shared rdfs:label, or a shared label/snonym pair).
See the docs on OWL.
Example of use outside ontologies
The metadata_converter project is intended to provide schema mappings between metadata schemes. The current focus is on standards used in environmental ‘omics’, of interest to the NMDC, such as MIxS, NEON, DarwinCore, and SESAR/IGSN.
The mappings between schema elements in SSSOM format can be found here.
SSSOM is a new standard, and may change based on community feedback, so there is not much tooling yet.
We have an early version of a Python toolkit for working with SSSOM files:
Additionally, rdf_matcher generates SSSOM files (more on rdf_matcher in a future post). Boomer will be adapted to take SSSOM as an input for probabilistic axioms.
We welcome comments, criticism, questions, requests for new metadata elements etc on our tracker.
For the current version of SSSOM we are indebted to the following people who crafted the spec:
- Ernesto Jimenez-Ruiz (City, University of London)
- John Graybeal (Stanford)
- William Duncan (LBL)
- David Osumi-Sutherland (EMBL-EBI)
- Simon Jupp (SciBite)
- James McLaughlin (EMBL-EBI)
- Henriette Harmse (EMBL-EBI)
The person responsible for the vast majority of the work on SSOM is Nicolas Matentzoglu who crafted the spec, wrote the metadata ontology, served as community liaison and coordinated feedback