There is a lot we still have to learn about SARS-CoV-2 and the disease it causes in humans. One aspect of the virus that we do know a lot about is its underlying molecular blueprint. We have the core viral genome, and broadly speaking we know the ‘parts list’ of proteins that are translated and spliced from the genome. There is a lot that we still don’t know about the proteins themselves – how variations affect the ability of the virus to infect a host, which molecules bind to these proteins and how that binding impacts their function. But at least we know the basic parts list. Or we think we do. There is the Spike protein (S), which adorns the surface of this virus like a crown, hence the name ‘coronavirus’. There are the 16 ‘non-structural proteins’ formed by cleavage of a viral polyprotein; one such protein is nsp5 which functions as a protease that performs this same cleavage. And there are the accessory proteins, such as the mysterious ORF8. The genomic blueprint and the translated and cleaved products can be illustrated visually:
Each of these proteins has a variety of different names, for example, nsp3 is also known as PLpro. The genome is small enough that most scientists working on it have memorized the core aliases such that human-to-human communication is relatively unproblematic.
Of course, as we all know, relying on gene and protein symbols for unique identification in a database for machine-machine communication is a recipe for disaster. Symbols are inherently ambiguous, so we assign identifiers to entities in order to disambiguate them. These identifiers can then be adorned with metadata such as symbols, names, aliases, descriptions, functional descriptions and so on.
As everyone working in bioinformatics knows, different databases assign different identifiers for the same entity (by varying definitions of ‘same’), creating the ubiquitous identifier mapping problem and a cottage industry in mapping solutions.
This is a perennial problem for all omics entities such as genes and proteins, regardless of the organism or system being studied. But when it comes to SARS-CoV-2, things are considerably worse.
It turns out that many problems arise from the relatively simple biological phenomena of cleavage of viral polyproteins. While the molecular biology is not so difficult (one parent protein as a source for many derivative proteins), many bioinformatics databases are not designed with this phenomena in mind. This is fine for scenarios where we can afford to gloss over differences between the immediate products of translation and downstream cleavage products. While cleavage certainly happens in the human genome (e.g POMC), it’s rare enough to effectively ignore in some contexts (although arguably this causes a lot of problems too). However, the phenomena of a single translation product producing multiple functionally discrete units is much more common in viruses, which creates issues for many databases when creating a useful ‘canonical parts list’.
The roll-up problem
The first problem is that many databases either ignore the cleavage products or don’t assign them identifiers in the same category as other proteins. This has the effect of ‘rolling up’ all data to the polyprotein. This undercounts the number of true proteins, and does not provision identifiers for distinct functional entities.
For example, NCBI Gene does a fantastic job of assembling the genetic parts lists for genes across all cellular organisms and viruses. Most of the time, the gene is an appropriate unit of analysis, and we can use gene identifiers as proxies for the product transcribed and translated from that gene. In the case of SARS-CoV-2, NCBI mints a gene ID for the polyprotein (e.g. 1ab), but lacks distinct gene IDs for individual cleavage products ,even though each arguably fulfill the definition of discrete genes, and each is a discrete non-overlapping unit with a distinct function. Referring to the figure above, nsp1-10 are all ‘rolled up’ into the 1ab or 1a polyprotein entity.
Now this is perhaps understandable given that the NCBI Gene database is centered on genes (they do provide distinct protein IDs for the products, see later), and the case can be made that we should only have gene IDs for the immediate protein products (e.g polyproteins and structural proteins and accessory ORFs).
But the roll-up problem also exists for dedicated protein databases such as UniProt. UniProt mint IDs for polyproteins such as 1ab, but there is no UniProt accession for nsp1-16. These are ‘rolled up’ into the 1ab entry, as shown in the screenshot:
However, UniProt do provide identifiers for the various cleavage products, these are called ‘chain IDs’, and are of the form PRO_nnnnn. For example, an identifier for the nsp3 product is PRO_0000449621). Due to the structure of these IDs they are sometimes called ‘PRO IDs’ (However, they should not be confused with IDs from the Protein Ontology, which are also called ‘PRO IDs’. Confusing, huh?).
Unfortunately these chain IDs are not quite first-class citizens in the protein database world. For example, the fantastic InterproScan pipeline is executed on the polyproteins, not the chain IDs. This means that domain and GO function calls are made at the level of the polyprotein, so it looks to a machine like there is one super-multifunctional protein that acts as a protease, ADP-ribose binding protein, autophagosome induction, etc. In one sense this is sort of true, but I don’t think it’s a very useful way of looking at protein function. It is more meaningful to assign the functions at the level of the individual cleavage products. It is possible to propagate the interproscan-assigned annotations down to the NSPs using the supplied coordinates, but it should not fall on consumers to do this extra processing step.
The not-quite-first-class status of these identifiers also causes additional issues. For example different ways to write the same ID (P0DTD1-PRO_0000449621 vs P0DTD1:PRO_0000449621 vs P0DTD1#PRO_0000449621 vs simply PRO_0000449621), and no standard URL (although UniProt is working on these issues).
The identical twin identifier problem
An additional major problem is the existence of two distinct identifiers for each of the initial non-structural proteins. Of course, we live with multiple identifiers in bioinformatics all the time, but we generally expect a single database to assign a single identifier for a single entity. Not so!
The problem here is the fact there is a ribosomal frameshift in the translation of the polyprotein in SARS-CoV-2 (again, the biology here is fairly basic), which necessitates two distinct database entries; here: each (called 1ab; aka P0DTD1 and 1a; aka P0DTC1). So far so good. However, while these are truly distinct polyproteins, the non-structural proteins cleaved from them are identical up until the frameshift. However, due to an assumption in databases that each cleavage product must have one ‘parent’, IDs are duplicated. This is shown in the following diagram:
While on the surface this may seem like a trivial problem with some easy workarounds, in fact this representation breaks a number of things. First it artificially inflates the proteome making it seems there are more proteins than they actually are. A parts list is less useful if it has to be post-processed in ad-hoc ways to get the ‘true’ parts list.
It can make it difficult when trying to promote the use of standard database identifiers over protein symbols because an arbitrary decision must be made, and if I make a different arbitrary decision from you, then our data does not automatically integrate. Ironically, using standard protein symbols like ‘nsp3’ may actually be better for database integration than identifiers designed for that purpose!
And when curating something like a protein interaction database or a pathway database an orthology database or assembling a COVID Knowledge Graph that deals with pairwise interactions, we must either choose arbitrarily or fully populate the cross-product of all pair combos. E.g. if nsp3 in SARS is orthologous to nsp3 in SARS-CoV-2, then we have to make four statements instead of one.
While I focused on UniProt IDs here, other major resources such as NCBI also have these duplicates in their protein database for the sam reason.
Kudos to Wikidata and the Protein Ontology
Two resources I have seen that gets this right are the Protein Ontology and Wikidata.
The Protein Ontology (aka PR, sometimes known as PRO; NOT to be confused with ‘PRO’ entries in UniProt) includes a single first-class identifier/PURL for each nsp, for example nsp3 has CURIE PR:000050272 (http://purl.obolibrary.org/obo/PR_000050272). It has mappings to each of the two sequence-identical PRO chain IDs in UniProt. It also has distinct entries for the parent polyprotein, and it has meaningful ontologically encoded edges linking the two (SARS-CoV-2 protein ontology available from https://proconsortium.org/download/development/pro_sars2.obo)
Wikidata also does a good job of providing a single canonical identifier that is 1:1 with distinct proteins encoded by the SARS-CoV-2 genome (for example, the entry for nsp3 https://www.wikidata.org/wiki/Q87917581). However, it is not as complete. Sadly it does not have mappings to either the protein ontology or the UniProt PRO chain IDs (remember: these are different!).
The big disadvantage of Wikidata and the Protein Ontology over the big major sequence databases is that they are not the big major sequence databases. They suffer a curation lag (one employing crowdsourcing, the other manual curation) whereas the main protein databases automate more albeit at the expense of quirks such as non-first-class IDs and duplicate IDs. Depending on the use case, this may not be a problem. Due to the importance of the SARS-CoV-2 proteome, sufficient resources were able to be marshalled on this effort. But will this scale if we want unique non-dupplicate IDs for all proteins in all coronaviruses – including all the different ones infecting bats and other hosts?
A compromise solution
When building KG-COVID-19 we needed to decide which IDs to use as canonical for SARS-CoV-2 genes and proteins. While our system is capable of working with alternate IDs (either normalizing during the KG build stage, or post build as part of a clique-merge step), it is best to avoid these. Mapping IDs can lead to either unintentional roll-ups (information about the cleavage product propagating up to the polyprotein) or worse, fanning-out (rolled up information then spreading to ‘sibling’ proteins); or if 1:1 is enforced the overall system is fragile.
We liked the curation work done by the Protein Ontology, but we knew (1) we needed a system that we could instantly get IDs for proteins in any other viral genome (2) we wanted to be aligned with sources we were ingesting, such as the IntAct curation of the Gordon et al paper, and Reactome plus GO-CAM curation of viral-host pathways. This necessitating the choice of a major database.
Working with the very helpful UniProt team in concert with IntAct curators we were able to ascertain that of the duplicate entries, by convention we should take the ID that comes from the longer polyprotein as the ‘reference’. For example, nsp3 has the following chain IDs:
- P0DTC1-PRO_0000449637 (from the shorter pp: 1a) [NO]
- P0DTD1-PRO_0000449621 (from the longer pp: 1ab) [YES]
(Remember, these are sequence-identical and as far as we know functionally identical).
In this case, we take PRO_0000449621 as the canonical/reference entry. This is also the entry IntAct use to curate interactions. We pretend that PRO_0000449637 does not exist.
This is very far from perfect. Biologically speaking, it’s actually the shorter pp that is more commonly expressed, so the choice of the longer one is potentially confusing. These is also the question of how UniProt should propagate annotations. It is valid to propagate from one chain ID to its ‘identical twin’. But what about when these annotations reference other cleavage products (e.g pairwise functional annotation from a GO-CAM, or an ortholog). Do we populate the cross-product? This could get confusing (my interest in this was both from the point of view of our COVID KG, but also wearing my GO hat)
Nevertheless this was the best compromise we could find, and we decided to follow this convention.
Some of the decisions are recorded in this presentation
Working with the UniProt and IntAct teams we also came up with a standard way to write IDs and PURLs for the chain IDs (CURIEs are of the form UniProtKB:ACCESSION-PRO_NNNNNNN). While this is not the most thrilling or groundbreaking development in the battle against coronaviruses, it excites me as it means we have to do far less time consuming and error prone identifier post-processing just to make data link up.
As part of the KG-COVID-19 project, Marcin Joachimiak coordinated the curation of a canonical UniProt-centric protein file (available in our GitHub repository), leveraging work that had been done by UniProt, the protein ontology curators, and the SciBite ontology team. We use UniProt IDs (either standard accessions, for polyproteins, structural proteins, and accessory ORFs; or chain IDs for NSPs) This file differs from the files obtained directly from UniProt, as we include only reference members of nsp ‘twins’, and we exclude less meaningful cleavage products.
This file lives in GitHub (we accept Pull Requests) and serves as one source for building our KG. The information is also available in KGX property graph format, or as RDF, or can be queried from our SPARQL endpoint.
We are also coordinating with different groups such as COVIDScholar to use this as a canonical vocabulary for text mining. Previously groups performing concept recognition on the CORD-19 corpus using protein databases as dictionaries missed the non-structural proteins, which is a major drawback.
Imagine a world
In an ideal world posts like this would never need to be written. There is no panacea; however, systems such as the Protein Ontology and Wikidata which employ an ontologically grounded flexible graph make it easier to work around legacy assumptions about relationships between types of molecular parts (see also the feature graph concept from Chado). The ontology-oriented basis makes it easier to render implicit assumptions explicit, and to encode things such as the relationship between molecular parts in a computable way. Also embracing OBO principles and reusing identifiers/PURLs rather than minting new ones for each database could go some way towards avoiding potential confusion and duplication of effort.
I know this is difficult to conceive of in the current landscape of bioinformatics databases, but paraphrasing John Lennon, I invite you to:
Imagine no proliferating identifiers
I wonder if you can
No need for mappings or normalization
A federation of ann(otations)
Imagine all the people (and machines) sharing all the data, you
You may say I’m a dreamer
But I’m not the only one
I hope some day you’ll join us
And the knowledge will be as one