on October 15, 2012 by in 2012, Comments (0)

A Task-Based Approach For Gene Ontology Evaluation


The Gene Ontology annotations are important tools for interpreting gene sets. Here, we introduce a method for evaluating Gene Ontology annotations based on the impact they have on gene set enrichment analysis. This task-based approach yields quantitative assessments grounded in real experimental data and anchored tightly to the primary use of the annotations. Using this method, our results indicate that human Gene Ontology annotations have improved significantly in their performance in enrichment analysis since 2002. We also demonstrate the sensitivity of common enrichment methods to annotation composition and completeness, implying that re-analysis of past experiments may yield new and important results. Supplementary materials are available at http://bitbucket.org/sulab/go-evaluation.


Erik L. Clarke, Benjamin M. Good, and Andrew I. Su*

The Scripps Research Institute, La Jolla, CA

1 Introduction

The Gene Ontology (The Gene Ontology Consortium, 2010) (GO) provides a framework to systematically classify and annotate gene function. The annotations associated with GO play a critical role in modern biology and cover many organisms. For the human genome, over 10,000 GO terms are used to annotate gene function in an expansive database of over 200,000 annotations.

Currently over half of human GO annotations are the result of manual curation, as opposed to electronically inferred annotations (IEAs) (homo sapiens GOA, revision 1.232). In general, manual annotations are considered to be of higher quality (Schnoes et al., 2009) than IEAs. Both sets of annotations are continually revised and expanded, based on published advances in the biological literature and sustained biocuration efforts. The ever-increasing size of the human GO annotation dataset (Figure 1) suggests that the structured representations of gene function are still very much in flux. Enrichment approaches are among the most commonly used applications of GO annotations, and they are an important tool in the analysis of genome-scale profiling experiments (Khatri and Drăghici, 2005).

Due to the importance of the GO annotations in modern biology, significant effort has been put into assessing the quality of the annotations. Providing measures of annotation completeness, accuracy, and precision is critical if researchers are to use the annotations in real-world applications with confidence. Previous work done on this subject (Buza et al., 2008) tackled this problem through the use of quality metrics built upon the annotation term’s ontology depth and evidence code. Other efforts (Dolan et al., 2005) examined the consistency of annotations across organisms to determine if comparative genomics using GO annotations were accurate. We have instead chosen a task-based approach that examines the completeness and utility of GO annotations through the lens of gene enrichment analysis.

Figure 1. Cumulative annotations from 2002 through 2011. This shows the increase in individual gene:term pairs from the start of 2002 up to the start of 2012. These numbers were created by progressively filtering earlier annotations from the 2012 human annotation file.

Here, we describe a framework to assess the effectiveness of the GO annotations in providing biologically accurate characterizations of a given dataset. We then use this framework to demonstrate how the statistical power to identify manually-identified, biologically representative GO terms (the positive controls) in a microarray dataset varies widely as a function of time.

We also describe how our approach can be used to model the progression of the GO annotations over time, either for a particular area of interest or for the body of annotations as a whole. Our framework has the ability to measure annotation effectiveness- a real-world assessment of quality- by eschewing ad-hoc metrics in favor of task-based performance (Porzel and Malaka, 2004). It also provides the data for a comparative analysis of GO annotations, allowing researchers to identify areas in which annotations are limited or non-specific. Because of these features, our framework allows unique insights into the real-world efficacy and growth of GO annotations over time.

2 Analytical framework

Our framework consists of the following steps:

  1. Identify term(s) of interest T. These may be representative terms of an area of interest or a sample across the GO structure.
  2. Collect datasets that are unambiguously related to T. For example, if T was angiogenesis, one might select datasets related to highly angiogenic tissue.
  3. Conduct an enrichment analysis on each dataset from (2).
  4. From the list of enriched terms resulting from (3), identify the rank and p-value of T. The expectation is that T will be significantly enriched against the background terms; whether or not that expectation is met, along with T‘s relative rank and score, indicates the efficacy of T‘s annotation set.

In this paper, we use the framework to observe changes in human GO annotations from 2004 through 2012. To do this, we create “snapshots” of the annotation sets as they existed each year by filtering out progressively earlier annotations based on their timestamps. The most recent snapshot thus contained annotations made up to January 1st, 2012, and the earliest contained those made before January 1st, 2004. We preserved the ontology structure of GO in 2012 across these snapshots for simplicity. However, it would not be difficult to alter them to reflect the GO structure during the time they intend to represent. Preserving the contemporary structure allows us to focus exclusively on the effect of annotation composition over time. This method has the side-effect of eliminating all inferred electronic annotations, since their timestamps are always the most current release date.

For step 4, above, we chose to use an enrichment analysis method based on Fisher’s exact test (Man et al., 2000). In this method, we first isolate significantly enriched genes by filtering out microarray probes which had maximum values less than the median value for the dataset. We then took the natural log of each element in the data table. To define groups of differentially expressed genes, we compared each sample subset against the rest of the samples with an independent T-test. We kept only probes whose nominal p-value < 0.05 and where the difference between group averages was > 1. From the GO terms, we kept only those annotated to between 15 and 500 genes. To find GO terms that characterized the enriched gene list, we performed a Fisher’s exact test for each term’s annotation set. The background gene set was the entirety of genes tested on the chip (less those removed beforehand); the background annotation set was the remainder of the GO term annotations.

The software we wrote to execute this framework is available as a series of Python scripts. These scripts and all supplementary data are available in Supplementary Materials.

3 Results

To preliminarily test our framework, we selected a dataset (Sun et al., 2006) (accession: GDS1962) from the NCBI Gene Expression Omnibus (GEO). We chose this dataset for its clear relationship to the term angiogenesis (GO:0001525). GDS1962 is a record of gene expression in gliomas compared to non-tumor control tissue. Because high-grade gliomas such as glioblastomas are highly vascularized and angiogenic (Ricci-Vitiani et al., 2010) and GDS1962 contains glioblastoma as a sample subset, this dataset can be linked unambiguously to angiogenesis. Moreover, we asserted that the glioblastoma subset in particular should be enriched in genes associated with angiogenesis in our analysis. We hypothesized that as the analysis was conducted with each successive snapshot of the annotations, the p-value of the term would improve, reaching maximal value in 2012.

Using the latest human GO annotation file in the enrichment analysis, we did indeed observe that genes involved in angiogenesis were significantly (p-value < 0.01) enriched among genes differentially expressed in the glioblastoma subset (p-value = 3.54E-7). Repeating the analysis with successively earlier GO annotation snapshots yielded the trend shown in Figure 2 (shown in red against the trends for the Biological Process (BP) top terms in 2012). From its first significant appearance in 2007 to its maximal value in 2012, the p-value for angiogenesis decreased by nearly five orders of magnitude. In comparison, Figure 3 shows the same trend against the trends of the top ten terms from the BP sub-ontology in 2006, which was when the dataset was released. We note that the most significant terms in 2006 diverge in rank over time; of the ten, only one is present in the top ten terms in 2012 (GO:0048731: system development) (Table 1).

Notable in the trends for the top ten BP terms for 2012 is how recently many of the terms became at all significant. Eight of the top ten only became significant after 2006; one was only significant after the annotations added in 2010. We also saw that by 2012, the terms related to angiogenesis, such as “[positive] regulation of angiogenesis“, had become more significantly enriched (data in Supplementary Materials). This provided evidence that not only was the recall power of the annotation set improving, but also its precision. While the term “regulation of angiogenesis” had increased in significance from 2009-2011 (p-value: 0.36 à 0.04), it was outranked by its more specific child term “positive regulation of angiogenesis”, which had a p-value change of 0.26 à 0.02 during the same time.

Figure 2. P-Values of Angiogenesis (red) and Ten Top Terms (grey) in 2012 for GDS1962. The red line shows the change in p-value from 1.0 (2004) to 3.5E-7 (2012). The grey lines are the p-values for the top ten terms from the GO Biological Process ontology in 2012 (see Table 1). Note that eight terms only become significant after 2006. The blue line is the significance threshold (p-value < 0.01). The vertical axis is on a log scale.

Figure 3. P-Values of Angiogenesis (red) and Ten Top Terms (grey) in 2006 for GDS1962. The red line shows the change in p-value from 1.0 (2004) to 3.5E-7 (2012). The grey lines are the p-values for the top ten terms from the GO Biological Process ontology in 2006 traced through time (see Table 1). The blue line is the significance threshold (p-value < 0.01). Note the rise in p-value for some of the terms from 2006-2012 above the significance threshold. Some lines end before 2012 because the gene sets exceeded the set filter size of 500 genes. The vertical axis is on a log scale.

4 Discussion

We have described here an analytical framework to quantify GO annotation quality in the context of its most common application: gene set enrichment analysis. Using this framework to obtain the results shown in Figures 2 and 3 reveals notable trends in the significance of particular terms over time. In particular, our positive control term rose in significance (e.g. lower p-value) for this dataset over the 9 tested years. While the other top terms in 2012 likewise rise in significance during this time, it is important to note how recently some of the terms became significant at all. Five terms that were not even present in the 2006 enrichment are part of the top ten in 2012, while only one term in the top ten from 2006 appear in the 2012 results. The disparity between these two time periods demonstrates the sensitivity of enrichment analysis to the annotation composition. It also highlights the danger of asserting the biological lack of a trait based on these methods; for enrichment analysis, it is clear that “absence of evidence is not evidence of absence.” Enrichment analyses that used old versions of the GO annotations may find significantly different results if re-examined with current annotations. The demonstrated importance of maintaining up-to-date annotations has spurred the development of tools such as GOChase (Park et al., 2005).

Table 1.Top Ten Biological Process Terms for 2006 and 2012



System development

Synaptic transmission

Cell-cell signaling

System development

Cell communication

Response to interferon-γ

Microtubule-based process

Secretion by cell

Nervous system development


Inositol lipid-mediated signaling


Phosphatidylinositol-mediated signaling


Regulation of catalytic activity

Blood coagulation

Regulation of cell cycle


Intracellular protein transport

Cellular response to interferon-γ

The p-values for 2006 range from 1.56E-11 to 2.4E-4. The p-values for 2012 range from 2.91E-26 to 1.28E-09. The terms are listed in order of rank. Terms in italics are common between both sets.

To draw broader conclusions about the efficacy of the GO annotations over time, the framework described here could be applied to any number of datasets for which positive control terms have been identified. Obtaining the trends of these terms as a group would provide a more comprehensive picture of how well the GO annotations perform in real-world applications. A comparison of these results to the authors’ conclusions at the time of publication may also reveal novel annotations that were previously unavailable. This type of large-scale analysis could be done through an automatic pipeline with a minimum of human involvement.

For an even broader approach, the framework could be modified to omit the positive control term altogether and look instead at the behavior of significant terms for each dataset. By tracking these terms over time, we can see if the most significant terms remain constant or vary greatly (as they did for this experiment). Divergent terms may reflect high annotation activity, while constant but less specific terms may indicate a dearth of annotation activity for that area.

Similar methods to the ones described here were used in an analysis of a long-term annotation initiative (Alam-Faruque et al., 2011) in which the authors examined the impact of the new annotations on standard enrichment analyses. As with our results, they found that the new annotations significantly increased the number of enriched terms, many of which were not present at all before the annotation efforts. Their results are an example of the divergent behavior we would expect from high annotation activity.

The framework described here provides a quantitative way to examine the GO annotations in the context of real-world applications. We have demonstrated how to extract trends, identify high and low activity annotation areas, and qualify the efficacy of annotations using biologically-meaningful measures. This framework is flexible enough to examine all or part of the GO annotations, across multiple species, and with various enrichment methods. Additionally, this framework could be used to evaluate different annotation methods. By comparing the performance of annotations generated with a particular method to the performance of canonical annotations, we would be able to determine their relative quality.

We chose to examine the change in term composition with a preservation of the 2012 ontology structure in this paper. However, if the ontology structure were changed to mimic the GO as it appeared during that time, we could assess how changes in the structure affect performance. Furthermore, while we limited our analysis to a single dataset as a proof-of-concept, a similar analysis of many datasets would provide significant information about the progress and character of GO annotations over time.


The authors acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924 to A.I.S).


Alam-Faruque, Y. et al. (2011) The Impact of Focused Gene Ontology Curation of Specific Mammalian Systems. PLoS ONE, 6, e27541.

Buza, T.J. et al. (2008) Gene Ontology annotation quality analysis in model eukaryotes. Nucleic acids research, 36, e12.

Dolan, M.E. et al. (2005) A procedure for assessing GO annotation consistency. Bioinformatics, 21 Suppl 1, i136-43.

Khatri, P. and Drăghici, S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21, 3587-95.

Man, M.Z. et al. (2000) POWER_SAGE: comparing statistical tests for SAGE experiments. Bioinformatics, 16, 953-959.

Park, Y.R. et al. (2005) GOChase: correcting errors from Gene Ontology-based annotations for gene products. Bioinformatics, 21, 829-31.

Porzel, R. and Malaka, R. (2004) A Task-based Approach for Ontology Evaluation. Proceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, 9-16.

Ricci-Vitiani, L. et al. (2010) Tumour vascularization via endothelial differentiation of glioblastoma stem-like cells. Nature, 468, 824-8.

Schnoes, A.M. et al. (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS computational biology, 5, e1000605.

Sun, L. et al. (2006) Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer cell, 9, 287-300.

The Gene Ontology Consortium (2010) The Gene Ontology in 2010: extensions and refinements. Nucleic acids research, 38, D331-5.



* To whom correspondence should be addressed.

No Comments

Leave a comment