With COVID-19 cases continuing to grow in number across the globe, scientists are forming new collaborations in order to better understand all aspects of SARS-CoV-2 together with its impact on human health. One aspect of this is organizing existing and emerging information about viral and host cell molecular biology, disease epidemiology, phenotypic progression, and effect of drugs and other treatments in individuals.
Knowledge Graphs (KGs) provide a way to organize complex heterogeneous information connecting different biological and clinical entities such as genes, drugs, diseases, exposures, phenotypes, and pathways.
For example, the following image shows a graphical (network) representation of SARS-CoV-2 proteins and host human proteins they are hypothesized to interact with, together with existing known human-human protein interactions, annotated with GO terms and drug target information:
Graphs such as this can be further extended with other information about the human and viral genes as it becomes available. Mechanisms such as endocytosis can also be included as nodes in the graph, as well as expression states of relevant human cells, etc. Existing ontologies like GO, HPO, Mondo, and CHEBI, together with their annotations can be conceived of as KGs.
These KGs can be used as data warehouses for querying data integrated in a single place. They can also be used as sources in Machine Learning, for tasks such as link prediction. For example: which compounds might be likely to treat a particular disease, based on properties of both the compound and the disease.
The KG-COVID-19 Knowledge Graph Hub
As part of a collaboration between the Monarch Initiative, the Illuminating the Druggable Genome KG project, and PheKnowLater we have been collaboratively building a KG for COVID-19. All of the source is in GitHub, in the Knowledge-Graph-Hub/kg-covid-19 repository.
The project is built around the concept of a KG “Hub”, a lightweight way to build a KG from multiple upstream sources. Any developer can follow the instructions to ingest a new source, and make a Pull Request on the repo. So far we have a number of different sources ingested (detailed in the yaml file), and more on the way. The output is a simple biolink-model compliant KG in a simple TSV format that is compatible with Property Graphs (e.g. Neo4J) as well as RDF graphs. In all cases we use CURIEs that are equivalent to standard URIs, such as OBO Class PURLs.
One of the goals is to use this alongside our N2V framework to discover new links (for example, identifying existing drugs that could be repurposed to treat COVID-19) and generate actionable knowledge.
Knowledge Graphs at the Virtual Biohackathon
The COVID-19 Biohackathon is a virtual event starting today (April 5 2020), lasting for a week, with the goal to “create a cohesive effort and work on tooling for COVID-19 analysis. The biohackathon will lead to more readily accessible data, protocols, detection kits, protein predictions etc.“. The Biohackathon was spearheaded by many of the same people behind the yearly Biohackathon which I have previously reported on.
One of the subgroups at the hackathon is the KnowledgeGraph group. This includes the kg-covid-19 contributors and other luminaries from the life sciences linked data / KG world, including neXtProt, UniProt, KnetMiner, Monarch, HPO, IDG-KG, GO.
I’m excited to see all these people working together as part of a dynamic group to produce tools that aim to help elucidate some of the biology underlying this critical threat. Of course, this is just one very small part of a massive global effort (really what we need to tackle COVID-19 is better public health infrastructure, widespread testing, ventilators, PPE for medical staff and workers on the front line, etc, see How the Pandemic Will End by Ed Jong). But I also think that this is an opportunity for collaborating on some of the crucial knowledge-based tools that have wide applications in biomedicine.
If you want to know more, the details of the biohackathon can be found on its GitHub page, and the kg-covid-19 repository can be found here, with contributor guidelines here.
5 thoughts on “Building a COVID-19 Knowledge Graph”
Thanks Chris – nice article. We are fairly new to the human and viral biology field. We are hoping to use the biohackathon as an opportunity to connect to like-minded people and understand how the KnetMiner tools can be adapted to work with KGs you guys are building. It would be fantastic if you can point us to the RDF/Neo4j endpoints.
Cool, will be great to try the KnetMiner tools.
Right now we don’t have an endpoint set up, you need to run the ingest code yourself. This is not ideal, so we are in process of making a jenkins instance that will regularly run the pipeline and dump various products (combined rdf dump, neo4j graph dump; maybe even deploy to a sparql endpoint), see https://github.com/Knowledge-Graph-Hub/kg-covid-19/issues/30