Using Wikidata for crowdsourced language translations of ontologies

In the OBO world, most of our ontologies are developed by and for English-speaking audiences. We would love to have translations of labels, synonyms, definitions, and so on in other languages. However, we lack the resources to do this (an exception is the HPO, which includes high quality curated translations for many languages).

Wikipedia/Wikidata is an excellent source of crowdsourced language translations. Wikidata provides language-neutral concept IDs that link multiple language-specific Wikipedia pages. Wikidata also includes mappings to ontology class IDs, and provides a SPARQL endpoint. All this can be leveraged for a first pass at language translations.

For example, the Wikidata entity for badlands is mapped to the equivalent ENVO class PURL. This entity in Wikidata also has multiple rdfs:label annotations (maximum one per language).

We can query Wikidata for all rdfs:label translations for all classes in ENVO. I will use the sparqlprog_wikidata framework to demonstrate this:

pq-wikidata ‘envo_id(C,P),label(C,N),Lang is lang(N)’

This compiles down to the following SPARQL which is then executed against the Wikidata endpoint:

SELECT ?c ?p ?n ?lang WHERE {?c <http://www.wikidata.org/prop/direct/P3859&gt; ?p . ?c <http://www.w3.org/2000/01/rdf-schema#label&gt; ?n . BIND( LANG(?n) AS ?lang )}

the results look like this:

wd:Q272921,00000127,badlands,en
wd:Q272921,00000127,Badlandoj,eo
wd:Q272921,00000127,Tierras baldías,es
wd:Q272921,00000127,Badland,et
wd:Q272921,00000127,بدبوم,fa
wd:Q272921,00000127,Badlands,fr
wd:Q272921,00000127,בתרונות,he
wd:Q272921,00000127,Badland,hu
wd:Q272921,00000127,Բեդլենդ,hy
wd:Q272921,00000127,Calanco,it
wd:Q272921,00000127,悪地,ja

Somewhat disappointingly, there are relatively few translations for ENVO. But this is because the Wikidata property for mapping to ENVO is relatively new. We actually have a large number of outstanding new Wikidata to ENVO mappings we need to upload. Once this is done the coverage will increase.

Of course, different ontologies will differ in how their coverage maps to Wikidata. In some cases, ontologies will include many more concepts; or the corresponding Wikidata entities will have fewer or no non-English labels. But this will likely decrease over time.

There may be other ways to increase coverage. Many ontology classes are compositional in nature, so a combination of language translations of base classes plus language specific encodings of grammatical patterns could yield many more. The natural place to add these would be in the manually curated .yaml files used to specify ontology design patterns, through frameworks like DOSDP. And of course, there is a lot of research in Deep Learning methods for language translation. A combination of these methods could yield high coverage with hopefully good accuracy.

As far as I am aware, these methods have not been formally evaluated. Doing an evaluation will be challenging as it will require high-quality gold standards. Ontology developers spend a lot of time coming up with the best primary label for classes, balancing ontological correctness, elimination of ambiguity, understanding of usage of terms by domain specialists, and (for many ontologies, but not all) avoiding overly abstruse labels. Manually curated efforts such as the HPO translations would be an excellent start.

About Chris Mungall
Computer Research Scientist at Berkeley Lab. Interests: AI / Ontologies / Bioinformatics. Projects: GO, Monarch, Alliance, OBOFoundry, NMDC

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: