on June 29, 2011 by Phillip Lord in 2011, Comments (0)

An Ontological Representation of Biomedical Data Sources and Records

Abstract

Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

Authors

Michael Bada, Kevin Livingston, and Lawrence Hunter

University of Colorado Anschutz Medical Campus, Aurora, CO, USA

Introduction

The rising importance of high-throughput analysis methods in biomedical research relies upon effectively making use of the ever-growing amount of data and knowledge stored in a profusion of distributed, heterogeneous biomedical databases. Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast stores of data in these many semantically divergent databases (e.g., Belleau et al., 2008; Ruttenberg et al., 2009). The motivation behind such integration efforts is to take advantage of these existing data and knowledge and apply it to current biomedical investigations: By querying the knowledge base and synthesizing relevant information with novel data and hypotheses of interest, we hope to accelerate scientific discovery. However, effective synthesis of the data relies upon a synthesis or mutual mapping of the disparate knowledge models of the data sources. Ideally, we would like to base this integration on a representation of biomedical reality (or at least our conceptualization of biomedical reality) grounded in high-quality community ontologies, but the representation in most large biomedical stores, which are predominantly founded on relational-database technology, is not consistent with this paradigm.

Because a rigorous representation of these stored data as ontologically grounded biomedical concepts will likely be difficult in many cases, we are proposing an OWL-based model for the representation of these database records as an intermediate solution for the integration of these data in RDF stores. We are using this representation in our construction of KaBOB (Knowledge Base of Biology), a large RDF store in which we are currently representing the contents of 17 prominent biomedical databases, including DIP, Entrez Gene, GAD, GOA, HGNC, HomoloGene, InterPro, KEGG, MGD, OMIM, PharmGKB, Reactome, TRANSFAC, and UniProt, and 14 ontologies whose terms are referenced in these databases. Our representation builds off the Information Artifact Ontology (IAO) (http://code.google.com/p/information-artifact-ontology/), which focuses on the representation of information content entities, where such an entity is defined as one that “is generically dependent on some artifact and stands in relation of aboutness to some entity”.

We have used this model to represent all of the records of the files of these biomedical data sources, resulting in a knowledge base of 5.6 billion RDF triples. This is an intermediate solution to the persistent problem of integration of disparate databases: While these data have not yet been rigorously represented in terms of biomedical concepts, the data records have been consistently represented as information content entities in one resource, thus permitting their effective querying and inference. Explicitly representing data sources’ information content entities such as records, fields, field values, and schemas will also enable the explicit representation of the axiomatizations for the conversion from data records to statements of biomedical concepts in the same knowledge base—a preferable strategy to burying such conversion in code. Additionally, this representation will allow us to make fine-grained statements of provenance of the assertions of biomedical concepts. Furthermore, as our representation is not specific to KaBOB, it could serve as a model for RDF-based distribution of databases, allowing interoperability among distributed data sources.

Results and discussion

We can categorize the components of our model into three layers of representation of data sources: (1) basic classes general to data sources, (2) instances of these basic classes to represent a specific data source, and (3) instances that represent the records of this data source.

 

Fig. 1. An OWLPropViz (http://www.wachsmann.tk/owlpropviz/) rendering of the basic classes and relations of our ontological modeling of data sources and their contents. The links among these classes are shown, as are the links to existing IAO classes.

 

Basic Representation of Data Sources

Starting with the basic classes general to data sources, we use the existing IAO:data set, which is defined in the IAO as a “data item that is an aggregate of other data items of the same type that have something in common”, i.e., a collection of like data. Though the large majority of the data currently stored in KaBOB were accumulated from biomedical databases, we have also allowed for the representation of data sets from sources other than databases, e.g., experimental data sets. We have created KIAO:data source as a subclass of IAO:information content entity and KIAO:database as a subclass of KIAO:data source. (KaBOB has branches that hold extensions of the ontologies that we have imported. Our notational convention for an extension of an ontology is to prefix the ontology’s official prefix with a “K”, with the semantics that it is the KaBOB extension of the given ontology; thus, KIAO concepts are extensions of the IAO.) A data set is declared to be an integral part of a data source, and a data source is composed of one or more data sets. As databases are often distributed in the form of one or more text files of like records, we have been regarding a data set as one of these files that is part of a database. However, this conceptualization is not specified within our ontology, and other users are free to regard, e.g., the Web pages of a database as its records. (We do not regard this as inconsistent, as data sources can be outputted as different data sets with different schemas; the user should make clear as to which data sets are being represented with our model.) We have additionally created KIAO:schema and KIAO:field as subclasses of IAO:information content entity and KIAO:record and KIAO:field value as subclasses of the more specific IAO:data item. A field is an integral part of a schema, and a field value is an integral part of a record, which is itself an integral part of a data set. A record has a schema as its template, and a field value has a field as its template; additionally, a data set is linked to the (same) template of its member data. A simplified view of the representation of the general classes of our model, i.e., those not specific with regard to data source is shown in Figure 1.

Representation of a Specific Data Source

An instance of KIAO:data source (or the more specific KIAO:database) is created to represent the data source itself, and an instance of IAO:data set is created for each data set of the data source and made an integral part of the data source instance. Additionally, for each data set, an instance of KIAO:schema is created and asserted to be the template of the data set. Lastly, for each created schema instance, an instance of KIAO:field is created for each field of the schema and made an integral part of the schema. Figure 2 displays an example set of these types of instances for our storage of the DIP database. All of these data-source-specific instances are dynamically created (and named) during RDF generation.

Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

Basic Representation of the Data of a Specific Data Source

This most low-level part of our model focuses on the representation of the data of a given data source through the instantiation of KIAO:record and KIAO:field value: An instance of the former for each entry (i.e., record) in a given data set and an instance of each field value of each of these records are created. Each record is made an integral part of the previously created data-set instance and has the previously created schema instance declared its template. Analogously, each field value of a given record is made an

Fig. 2. An OWLPropViz rendering of instances representing the DIP database, a DIP data set (a file of protein-protein interactions), the schema for this data set, and two basic fields of this schema (representing the two interactors of an interaction).

 

integral part of the record and has the previously created field instance declared its template. Each field value is its own instance, even if its actual value is the same as another; representing the actual values is not discussed here due to space limitations. Figure 3 displays an example set of instances representing one record of the data set represented in Figure 2 and two of the field values of this record. This layer that represents the actual data of the data source accounts for the large majority of the triples generated based on this model. As seen in Figure 3, three triples are generated for each record instance and four triples for each field-value instance (including one denoting its value, not shown in the figure). This is a more verbose representation compared to one in which assertions mirroring the structure of the data source are created; however, whereas other models typically conflate representation of data-source content with assertions of biomedical concepts, our model accurately represents this content, which can then be used to track the provenance of biomedical-concept assertions.

Representation of Record Substructure

Some data sets of data sources have complex fields, i.e., have substructure that cannot be represented in terms of straightforward field values. This field substructure can manifest itself in several ways but usually involves certain values of fields semantically linked to other values. One type of substructure can be seen in a pair of fields within a data set each having multiple values where the field values of one field correspond to those of the other field, e.g.:

 

MI:0045|MI:0114    PMID:8494892|PMID:8043575

 

These are the sets of field values for two fields in the DIP data set shown in the previous examples (specifically the fields identifying the method used to detect a given interaction and the PubMed ID of the article reporting this interaction) of a record of the DIP interaction file represented in Figure 1. In this example, the first value of the interaction-detection-method field is tied to the first value of the publication field, and analogously for the second values of these two fields. We have conceptualized such record substructure as subrecords; e.g., using this pair of complex fields, one subrecord contains the first values representing values of these two fields (MI:0045 and PMID:8494892) as its values and another subrecord contains the second values of these two fields (MI:0114 and PMID:8043575) as its values. Without such an explicit representation of this substructure, the linkage implied in the complex structure of this file would be lost.

 

Fig. 3. An OWLPropViz rendering of instances representing one record (DipInteractionFileRecord1) of the data set represented in Fig. 2 (DipInteractionFile) and two field values (DipInteractionFileRecord1_FV1 and DipInteractionFileRecord1_FV2) of this record.

 

Fig. 4. An OWLPropViz rendering of instances representing record substructure of the example presented in Section 2.4. The first value of each of the two fields (DipInteractionFileRecord1Subrecord1_FV1 and DipInteractionFileRecord1Subrecord_FV2) are integral parts of a subrecord (DipInteractionFileRecord1Subrecord1), which is an integral part of the full record (DipInteractionFileRecord1). Additionally, these two field values have two respective subfields as their templates (DipInteractionFileSubfield1 and DipInteractionFileSubfield2), and these subfields are integral parts of a subschema (DipInteractionFileSubschema1), which is an integral part of the schema of the full record.

 

 

To represent this substructure, the model shown in Figure 1 was made slightly more complex. A subrecord is made an instance of KIAO:record (as for ordinary records). While a record can only be part of a data set in the basic representation, in this richer representation we have asserted that a record is an integral part of a record or a data set (i.e., the union of these classes) since a subrecord is part of a record. Thus, we can declare that the subrecord is a part of its record, which is therefore transitively part of the data set. Just as a record has a schema as its template, a subrecord has a subschema as its template, so a subschema for the subrecord is made an instance of KIAO:schema (as for ordinary schemas). As the subrecord is a part of its record, the subschema is analogously a part of the schema of the record. Additionally, a record can contain multiple subrecord types without conflict. Figure 4 displays instances of a subschema, subrecord, and its two values based on this example.

Related work

Biomedical RDF-triple stores often conflate database records, i.e., information content entities, and the concepts and assertions denoted by them. Such a distinction is clearly made in the Neurocommons knowledge base (Ruttenberg et al., 2009), but an explicit representation of the database records is not attempted; standardized URIs are instead used to refer to records. Our model has not been designed to be used to make assertions of biomedical concepts, as in the referent tracking of Ceusters et al. (Ceuster and Smith, 2005). There have been efforts to programmatically transform the contents of databases to parallel ontological constructs (e.g., Astrova and Stantic, 2004); however, parallel representation is likely often incorrect given the substantial differences between modeling of data schemas and ontological engineering (Spyns et al., 2002). D2R MAP is an XML-based language designed to map database data to RDF (Bizer, 2003), whereas our proposal is a formal ontological model and is not limited to relational data. Relational.OWL is an OWL-based effort to model database schemas to serve as a rigorous exchange format (de Laborda and Conrad, 2005), whereas our model is not designed to capture all details of data-source schemas but to represent the data of these sources; furthermore, Relational.OWL is strictly modeled on the relational paradigm, whereas we have based our model on a more general concept of a data set, which may or may not originate from a database, extended from the IAO.

Conclusions

We have presented an ontological model extended from the Information Artifact Ontology for the representation of data sources and their content. Using this model, we have consistently represented the records of 17 prominent biomedical databases as 5.6 billion RDF triples (which we have loaded in 3.1 days into a bigdata store); we regard the representation of these information content entities as an intermediate representation toward one in terms of biomedical concepts. In addition to affording querying and inference over these wide-ranging data, this representation will allow us to declaratively model the axiomatizations for conversion to biomedical concepts and tracking of provenance at a fine-grained level.

Acknowledgements

We gratefully acknowledge the support of this work by NIH grants R01LM009254, R01GM083649, and R01LM008111.

References

Astrova, I. and Stantic, B. (2004) Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms. 1st Eur. Sem. Web Symp. 327-341.

Belleau, F. (2008) Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J. Biomed Inform.
41(5), 706-16.

Bizer, C. (2003) D2R MAP – A Database to RDF Mapping Language. Proc. WWW 2003.

Ceusters, W. and Smith, B. (2005) Tracking Referents in Electronic Health Records. In: Engelbrecht, R. et al. (eds.) Proc. 2005 Medical Informatics Europe, IOS Press, Amsterdam, 71-76.

de Laborda, C.P. and Conrad S. (2005) Relational.OWL: a data and schema representation format based on OWL. Proc. 2nd Asia-Pac. Conf. Conceptual Modelling
(APCCM), 89-96.

Ruttenberg, A., Rees, J.A., Samwald, M., and Marshall, M.S. (2009) Life sciences on the Semantic Web: the Neurocommons and beyond. Briefings in Bioinform. 10(2), 193-204.

Spyns, P., Meersman, R., and Jarrar, M. (2002) Data modelling versus Ontology engineering. SIGMOD Record
31(4), 12-17.

No Comments

Leave a comment

Login