SPARQLProg at BioHackathon 2019

I’m at the 2019 BioHackathon in Fukuoka. This is my first BioHackathon, and I am loving it so far!

We have organized ourselves into different hacking groups, with a lot of interactions between them. There is a lot of cool stuff going on in cutting edge areas such as genome graphs and Markov logic networks. I’m getting FOMO wishing I could be part of all of the different groups. The BioHackathons have traditionally had a strong focus on semantic web technologies, and there are a number of fantastic SPARQL endpoints here in Japan. My own group coalesced around the general idea of applying logic programming approaches to bioinformatics problems. The group includes Will Byrd (of miniKanren and The Reasoned Schemer fame), Pjotr Prins, Deepak UnniHirokazu Chiba, and Shuichi Kawashima.

During the symposium, I presented on the sparqlprog framework. The slides are here:

The basic idea is to use logic programming as an over-arching framework, encompassing RDF and SPARQL, but allowing for additional expressivity and power.

One of the basic ideas here is to allow you to write complex queries using meaningful n-ary predicates. For example, if we want to query for all human genes in a particular range on a particular chromosome, and get the mouse orthologs, then we should be able to write this in as high-level way as possible, for example like this:

feature_in_range(grch38:’X’,10000000,20000000, HumanGene),
has_mouse_ortholog(HumanGene, MouseGene)

“feature_in_range” and “has_mouse_ortholog” are logic predicates. Unlike RDF, logic programming predicates can have any number of arguments rather than two (which is why the above notation is used, rather than infix, which only works for binary). The bold font indicates variable names. This query is then translated to SPARQL, which is significantly more verbose:

SELECT ?g ?h WHERE {
?g sio:000558 ?h .
?h obo:RO_0002162 taxon:10090 .
?g a obo:SO_0001217 .
faldo:location [
faldo:begin [
faldo:position ?b ;
faldo:reference homo_sapiens/GRCh38/X ] ;
faldo:end [
faldo:position ?e ;
faldo:reference homo_sapiens/GRCh38/X ]]
FILTER (?b > 10000000) .
FILTER (?e < 20000000)
}

The two predicates in the query are defined using simple rules in a logic program. A rule consists of a ‘head’ predicate followed by the implication operator ‘:-‘ and a ‘body’ which specifies a list of conditions.

feature_in_range(Ref,MinBegin,MaxEnd,Feat) :-
location(Feat,Begin,End,Ref),
Begin >= MinBegin,
End =< MaxEnd.

location(Feat,Begin,End,Ref) :-
location(Feat,Loc),
begin(Loc,BeginPos),
position(BeginPos,Begin),reference(BeginPos,Ref),
end(Loc,EndPos),
position(EndPos,End),reference(EndPos,Ref).

Queries using defined predicates are recursively rewritten until we bottom out in binary RDF predicates or builtin functions.

This is nice as it adds composability to SPARQL, and frees the query author from repeating common patterns across multiple queries.

But the overall framework is more powerful as programs can be more expressive than SPARQL, for example, involving recursion or backtracking. Portions are executed in the local logic programming environment, and portions are executed remotely on a SPARQL endpoint.

SparqlProg-BH-2019.png

SPARQLProg execution environment. Clients can send queries and optionally a program (rules) to the SPARQLProg environment (using the pengines web logic protocol). Queries are by default compiled to SPARQL and executed remotely. Optionally, the program may seemlessly mix local and remote execution, with local execution allowing more expressivity.

SPARQLProg can be executed on the command line (examples here). It also runs as a service, and there is a docker container available, so all you need to do is:

docker run -p 9083:9083 cmungall/sparqlprog

An example of how to connect via Python can be found in some example Jupyter notebooks such as this one.

Reactions at the biohackathon seem to range from confusion to excitement. It’s fun to see people’s reactions when they ‘get it’. There seems to be a lot of enthusiasm from locals, with people contributing wrappers for KEGG and TogoVar, an integrated database of Japanese genomic variation.

Next up is a framework that will allow querying over specialized genome variant graph engines…

I am also working with Pier Luigi Buttigieg on ENVO. I recently developed a toolkit based on SPARQLProg for aligning an ontology to Wikidata. One of our goals is to upload GAZ (the OBO Gazetteer) into Wikidata, and align ENVO. This will allow us to extract ENVO classifications for all 600,000 entries in GAZ. The repo for this work can be found here.

More updates later, back to the hacking for now…

Advertisements