Exploring Gene Ontology Annotations with OWL
Motivation: Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other activities. Tools, such as AmiGO, allow exploration of genes based on their GO annotations. This human driven exploration and querying of GO is obviously useful, but by taking advantage of the ontological representation we can use these annotations to create a rich polyhierarchy of gene products for enhanced querying. This also opens up possibilities for exploring GO annotations (GOA) for redundancies and defects in annotations.
To do this we have created a set of OWL classes for mouse genes and their GOA. Each gene is represented as a class, with appropriate relationships to the GO aspects with which it has been annotated. We then use defined classes to query these gene product classes and to build a complex hierarchy. This standard use of OWL affords a rich interaction with GO annotations to give a fine partitioning of the gene products in the ontology.
Simon Jupp1*, Robert Stevens1 and Robert Hoehndorf2
1School of Computer Science, University of Manchester, UK. 2 Department of Genetics, University of Cambridge, Cambridge, UK
The creation of the Gene Ontology (GO) (Harris 2004) has had a major impact on the description and communication of the major functionalities of gene products for many species. GO has some 24,000 terms for annotating gene products and is used in around 40 species databases and in cross species databases such as Uniprot and Interpro (Camon 2004). It is widely used for querying such databases, making cross species comparison or in data analyses, such as over-expression analysis in microarray data (Baehrecke 2004).
The GO is mainly used as a controlled vocabulary to ensure genes are consistently annotated using standard terminology across many data resources; this alone offers many benefits for data integration and analysis. GO is, however, much more than a vocabulary; it also provides additional information about how these GO terms are related to each other. These relationships have a strict semantics in their representation that bring added value to the GO. For example, the hierarchical relationships allow for all kinds of a particular term to be retrieved, as well as those with an annotation of the term itself. These and other relationships provide support for navigation, as well as making explicit the relationship between the entities being described.
The AmiGO browser (Carbon 2009) (see also DynGO (Liu 2005), QuickGO (Binns 2009)) provides such an interface and exploits the hierarchical structure of the gene ontology to support query expansion. For example, when searching AmiGO for receptor activity genes, the results returned also include genes involved in GPCR activity as GPCR activity is a subclass of receptor activity. This hierarchical structure is also useful for data mining tasks (Pavlidis 2004). For example, enrichment analysis is a common technique used in the analysis of high-throughput gene expression data; sets of interesting genes can be grouped or clustered based on common GO annotations
(See http://www.geneontology.org/GO.tools.shtml for more GO tools).
Whilst highly useful, many of these tools fail to exploit the full potential of the GO’s representation for reasoning and querying over gene annotations. Most of the tools that were investigated do not facilitate rich querying that takes into account the semantics of the GO relationships. For example, it was difficult to ask for all gene products that are located in a membrane or part of a membrane, that are receptor genes involved in a metabolic process. To answer such a query correctly some form of reasoning over the ontology is required. The ability to perform such rich queries would enable more precise and flexible exploration of the GO annotations.
The Web Ontology Language (OWL) and the Open Biomedical Ontology (OBO) format have a strict semantics that makes it possible to use automated reasoners to help build and use knowledge captured in an ontology. In order to explore the potential of reasoning over the GO annotations we need to describe the relationships between the genes and their annotation within a framework that can also exploit the semantics encoded into the GO. Our approach uses OWL, for which a mapping from OBO exists, to represent both the GO annotations alongside the GO to exploit the GO and its annotation for querying and exploration.
As an ontology of attributes of gene products, GO itself does not explicitly contain gene products; GO annotations are attached to gene products in databases or flat-files (See http://www.geneontology.org/GO.annotation.shtml). Using the compositional approach to ontology building we can create an ontology from these annotations that explicitly relates gene products to GO and then add defined classes to impose a hierarchy on the gene products. For example, we can create a defined class (in Manchester OWL syntax) such as:
This defined class will recognize any class of gene product that has both of these attributes, or children of these attributes, and subsume it within the hierarchy of gene products. In this standard use of OWL and automated reasoning, we can add more such defined classes to build an arbitrarily complex polyhierarchy for querying and navigation of entities annotated with the GO. Figure 1 shows such an inferred polyhierarchy centered on annotations for the GRM1[MGI:1351338] gene product.
Step one: An initial set of GO annotations for mouse genes were downloaded from the Mouse Genome Informatics (MGI) site. In order to reduce the size of the dataset to ease development we only selected annotations that had evidence codes of EXP, IDA, TAS, RCA, IC (See http://www.geneontology.org/GO.evidence.shtml for definitions). We also filtered these genes to exclude the RIKEN cDNA genes.
Step two: In order to express these annotation ontologically we created a primitive OWL class for each of the genes. We then describe each gene according to its annotation using existential OWL restrictions. From this a simple pattern emerges where each gene class is restricted by the corresponding GO term from the annotation.
Rather than generating the axioms by hand, we used the Ontology Pre-Processor Language (OPPL) to specify and instantiate the pattern (Iannone 2009). OPPL allows us to express patterns for each of the three branches of GO. A Java program is then used to parse the GO annotations file downloaded from MGI and instantiate the OPPL and generate the OWL ontology.
Step three: We created a GO association ontology by importing this file of generated primitive classes together with the three aspects of GO (in their OWL form) into a master ontology file.
Step four: The generated GO association ontology was then manually edited using Protégé 4.1 (beta, build 220) to add defined classes. We initially created defined classes to represent subsets of the top level GO terms by defining OWL classes for genes found in a particular cellular compartment. For example, we created the class of mitochondrial gene products as follows:
We repeated this basic pattern for the top level cellular compartments, and then continued for the biological processes and molecular function classes. From these base level class descriptions we then began to create more complex class descriptions composed of classes previously created. We then created a class for the mitochondrial receptor gene products with the following class definition:
This pattern was repeated until we began to create classes that were composed of terms from all three branches of the gene ontology. For example, to find the mitochondrial gene products that are receptor gene products and participate in cell killing we generated the following OWL defined class:
An arbitrary number of defined classes cam be created in this pattern, each of which will subsume and be subsumbed by other classes fitting the definition in the growing ontology. At the leaves of this polyhierarchy we have the primitive classes representing the gene products themselves.
We extracted all mouse genes from the MGI database and applied our filtering, producing a total of 29,559 gene-annotation pairs (see step one). The conversion to OWL classes gave 10,104 primitive gene product classes (see step two). After importing GO, the final ontology of primitive gene product classes and the GO contained 39,332 primitive classes (see step three).
We created a further 120 defined classes describing various gene categories (see Step four). As an exemplar, we concentrated on genes with receptor activity, located in some membrane and with processes involved in cell growth, metabolism and signal transduction.
In order to classify the ontology we used several DL reasoners. Classification was performed on a 2.2ghz i7 Mac Book Pro requiring around 3GB of memory. Table 1 shows the performance times for each reasoner.
Average Timing (Seconds)
To illustrate the capabilities of the generated ontology we show a query to get the genes that are located in the nuclear membrane of the cell, that participate in some metabolic process and have the function of some receptor activity. Figure 1 shows a screen shot from Protégé of a define class named MetabolicNuclearMembraneReceptorGeneProduct. This class is composed of the intersection of three other defined classes named NuclearMembraneGeneProduct, ReceptorActivityGeneProduct, and MetabolicProcessGeneProduct. These classes are defined in OWL as the following:
After reasoning over the ontology we infered that only the Grm1 gene is a subclass of our MetabolicNuclearMembraneReceptorGeneProduct class. Although this is a relatively simple query, in order for it to answer some reasoning is required, which is made possible by this approach of using OWL. Our attempts to replicate such a query in the popular online tools for querying GOA using a simple conjunction of these terms yielded no results, showing a clear advantage to the OWL approach over existing tools.
Figure 1. The classification of the GRM1 gene according to generated defined classes for gene products
Although the queries demonstrated here are relatively simple, they serve to illustrate the potential of a pure OWL approach to querying gene ontology annotations. Using similar patterns we can begin to imagine more complex class description that utilise additional expressivity in OWL, such as the use of complement classes to query for genes that ‘has_molecular_function some not (ReceptorActivity)
and participates_in some SignalTransduction’, which would find those genes that have a function other than receptor activity and are involved in signal transduction. (Note that the semantics mean that such genes can have a receptor activity, but must have an activity other than receptor activity. GO annotations are not closed, so we cannot say ‘not (has_molecular_fucntion some ReceptorActivity)’ and expect to recognize any genes.)
If all the GO molecular function classes were to be replicated as defined classes, we would replicate the GO molecular function ontology; the same would happen for each aspect of GO. As we combine the different aspects of GO in more complex defined classes, we will generate a more complex hierarchy of gene products.
The announcement of the GO cross products extension to the GO will provide logical definitions for the GO classes. These definitions will enable richer OWL queries over the GO annotations and the potential to infer more annotations on existing GOA genes (Fernández-Breis 2010).
The next stage of development in our work will be to incorporate more defined classes and different ontologies such as the phenotype annotations for mouse genes and descriptions of cells in which they are known to function. This will enable queries such as those genes that are known to participate in processes that are involved in a particular phenotype.
Our current exploratory implementation performs well in practice, but the number of defined classes is currently small. Adding more expressive constructs to the ontology will afford further opportunities; adding disjointness axioms to GO may help us uncover mis-annotations and we have yet to fully exploit property characteristics such as transitivity and functionality. We can also explore ways of flexibly incorporating annotations with differing degrees of confidence through use of the GO evidence codes and programmatically generating the defined classes that form the polyheirarchy of genes. Finally, we need to present the ontology via tools such as the OWLBrowser .
In this work we have made a straight-forward use of OWL and automated reasoning to deliver a flexible way to query all aspects of GO annotations. The polyhierarchy formed also provides similarly rich navigation in a gene product orientated setting. Finally, we provide a flexible framework for exploring and manipulating GO and other valuable annotations developed by the community.
The ontologies and associated files are available to download from
http://owl.cs.manchester.ac.uk/mouse_goa/index.html. We recommend Protégé 4.1 beta for viewing the generated ontology.
This work was funded by the e-LICO project — EU/FP7/ICT-2007.4.4.
Harris MA et al (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. Jan 1;32(DATABASE):D258–D261
Evelyn Camon and Rolf Apweiler et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology Nucl. Acids Res. (2004) 32(suppl 1): D262-D266 doi:10.1093/nar/gkh021
Eric H Baehrecke, Niem Dang, Ketan Babaria and Ben Shneiderman. Visualization and analysis of microarray and gene ontology data with treemaps BMC Bioinformatics 2004, 5:84doi:10.1186/1471-2105-5-84
Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, the AmiGO Hub, and the Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics. 2009 January 15; 25(2): 288–289.
Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res. 2004 Jun;29(6):1213-22.
Luigi Iannone, Alan L. Rector, Robert Stevens: Embedding Knowledge Patterns into OWL. ESWC 2009: 218-232
Jesualdo Tomás Fernández-Breis, Luigi Iannone, Ignazio Palmisano, Alan L. Rector, Robert Stevens. Enriching the Gene Ontology via the Dissection of Labels Using the Ontology Pre-processor Language. In Proceedings of EKAW’2010. pp.59~73