on October 23, 2012 by Phillip Lord in 2012, Comments (0)

Dizeez: an online game for human gene-disease annotation

Abstract

Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of identifying novel gene-disease annotations, i.e. gene-disease links well-established in the literature, but not yet reflected as structured annotations in public databases. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation.

Dizeez is available at http://genegames.org.

Authors

Salvatore Loguercio, Benjamin M. Good and Andrew I. Su*

The Scripps Research Institute, La Jolla, CA 92037, USA

1 Introduction

Using the tools of high-throughput biology, scientists can quickly identify long lists of candidate genes that differ between two experimental conditions. Structured gene annotations are essential to interpret these gene lists, and discover fundamental properties like gene function and disease relevance. Gene set enrichment, pathway modeling, and cross-genome comparisons are just a few of the analyses that depend on structured gene annotations (Huang et al., 2009). The importance of methods like these will only grow as the rate of genomic data generation increases.

However, the representation of gene annotations is quite sparse. For example, at the time of writing only 57% of human protein-coding genes have two or more human-curated GO annotations. Structured data for diseases are even less complete. These gaps are, at least in part, due to inefficiencies in the translation of scientific knowledge into structured annotations. Currently, we rely on a few large biocuration groups to translate all of the peer-reviewed literature into structured annotations. However, these centralized efforts simply cannot keep up with the rate of biomedical data generation. It has been estimated that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms (Baumartner et al., 2007). The biocuration community itself has noted that “the exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility” (Howe et al., 2008).

Recently, “crowdsourcing” has emerged as a complementary approach that directly harnesses the collaborative efforts of large communities of people. This principle, which has been the foundation of many successful web-based applications, has also been applied to scientific challenges of massive scale. For example, the Galaxy Zoo initiative enables citizen astronomers to classify galaxies in large sets of celestial images (Lintott et al., 2008), and the Gene Wiki project engages the research community to create a gene-specific review article for every human gene (Huss et al., 2009). Similar initiatives have emerged for RNA families (Gardner et al., 2011) and biological pathways (Kelder et al., 2012).

One emerging trend among crowdsourcing initiatives is the use of games as a mechanism to attract contributors – in particular, ‘games with a purpose’ (GWAPs) that collaboratively harness gamer’s time and energy for productive ends. One of the first GWAPs, called the “ESP Game”, had the ambitious goal of tagging all online images with informative keywords. It resulted in 50 million labels produced by more than 200,000 players (Von Ahn and Dabbish, 2008). Similarly successful games were later developed to annotate music, text, and videos. Online games have also been shown to be an effective collaborative platform to address challenging biological challenges. For example, the Foldit game (http://fold.it, Khatib et al., 2011) addresses a fundamental biomedical challenge: computational protein folding. It has harnessed the efforts of over 300,000 gamers to predict protein structure from primary sequence, to provide accurate structural models that led to the crystal structure of a previously intractable retroviral protease, and to design new protein folding strategies and algorithms (Good and Su, 2011). Other examples include Phylo for improving multiple sequence alignments (Kawrykow et al., 2012) and EteRNA for designing RNA structures (http://eterna.cmu.edu).

Here we introduce Dizeez , an online game aimed at identifying novel gene-disease annotations, i.e. gene-disease links well-established in the literature, but not yet reflected as structured annotations. We provide preliminary results from game play online and at scientific conferences. These data suggest that even after limited game play, novel gene-disease annotations can be mined from game playing logs.

2 Data sources and game design

“Dizeez” is a multiple choice quiz where the player is presented with a disease drawn from the Human Disease Ontology (Schriml et al., 2012) (the “Clue”) and a multiple-choice selector with five genes, only one of which has prior evidence linking it to the Clue disease (Figure 1). We used a set of 3,439 candidate gene-disease links mined from the Gene Wiki (Good et al., 2011) as the input data set for the Dizeez game. The game randomly selects one of these links (the “right” answer) and hides the disease among four randomly chosen diseases. Whether or not the player guesses the right answer determines whether they get points in the game, but all answers are logged by the system as gene-disease “assertions”.

Figure 1. Dizeez – main game interface.

Dizeez allows players to select a specific area of biology (for example, by disease or protein family) that best matches their expertise. A “recap” functionality at the end of the game shows supporting evidence (based on text extracted from the Gene Wiki and GeneRIFs) for each gene-disease association recorded in a game. Users can review the game log and even suggest new evidence for gene-disease associations (Figure 2).

Figure 2. Dizeez – game review.

3 Results

As mentioned above, every player Guess in the game can be interpreted as an assertion of a putative gene annotation between the Clue (gene) and the Guesses (diseases). Candidate annotations that are independently reported across multiple players will obtain the highest confidence scores, according to the value of independent replication. This concept of replication or ‘voting’ is used extensively to improve results in related crowdsourcing initiatives (Lintott et al., 2008, Von Ahn and Dabbish, 2008). In other contexts this confidence is referred to as ‘inter-annotator agreement’ and is used to assess the quality of professional annotations (Camon et al., 2005).

We released Dizeez to the community, publicizing its existence through our lab blog, twitter account and game play at a scientific conference. Within a few months of the game release (December 2011), 713 games had been played to completion by over 180 unique individuals – 24 registered users played 203 out of 713 games. Overall, players provided 5,282 guesses resulting in 4,585 unique gene-disease assertions. 1,492 out of these 4,585 assertions were known in OMIM (via Human Disease Ontology cross references to OMIM), 2188 in PharmGKB and 1102 in Gene Wiki. 1350 gene-disease assertions were novel (including correct and incorrect guesses).

Among the gene-disease assertions provided most often by game players, we found 10 associations occurring 6 or more times. All of these associations were previously known via Gene Wiki/OMIM, hence they provided a positive control. For example, the gene WRN (“Werner syndrome, RecQ helicase-like”) was linked to the disease Werner syndrome and the gene CRYGC (“crystallin, gamma C”) was linked to cataracts.

Next, we mined Dizeez game logs for novel gene annotations, i.e. gene-disease links well established in the literature, but not yet encoded as structured annotations. We focused our analysis on the 223 gene-disease assertions that were provided by game players more than once and that were not previously found in OMIM or PharmGKB.

There were 19 assertions that were provided by players four or more times. We ranked them using the Normalized Medline Distance (NMD) – a proxy for gene and disease co-occurrence in Pubmed articles (Handcock et al., 2010) and manually evaluated the top five associations. As shown in Table 1, we could find multiple pieces of evidence for those assertions in the literature.

Table 1. Top five gene-disease associations provided four or more times in Dizeez and not found in OMIM/PharmGKB.

Gene Symbol

Disease name

DO ID

PubMed support (PMID)

DHH

Polyneuropathy

1389

11017805, 15356051

SOX8

Mental retardation

1059

18076105, 10662550

ITGAL

leukocyte adhesion deficiency

6612

11753075, 7628754

TG

Graves disease

12361

18385936, 22517745

SNAI2

Breast carcinoma

3459

22393463, 22151997

We further filtered the results, removing all gene-disease links already annotated in the Gene Wiki and known genes in OMIM. We could still find 25 assertions provided by players more than once (Supplementary file available at http://goo.gl/H73c8). As before, we ranked them via NMD and manually evaluated the top five associations. Again, as shown in Table 2, we could find evidence for these assertions in the literature. Interestingly, almost all the publications listed in Tables 1 and 2 are very recent (2010 or later).

Table 2. Top five gene-disease associations provided two or more times in Dizeez and not found in Gene Wiki/OMIM/PharmGKB.

Gene Symbol

Disease name

DO ID

NMD

PubMed support (PMID)

ABCB5

Acute myeloid leukemia

9119

0.75

22044138, 19477512, 19394083

SOX8

Leukemia

1059

0.83

21183939, 20672360

ITGAL

Carcinoma

305

0.83

21940900, 20660615

TG

Retinoblastoma

768

0.87

15172750

BCL10

Melanoma

1909

0.89

22280162, 22094256

Finally, we analyzed the player data available for 24 registered users. Notably, almost half of the registered users provided ten or more guesses with overall accuracy higher than 30%. Perhaps not surprisingly, overall accuracy seems to correlate with the amount of time spent on each association (Figure 3). This observation reveals a useful filtering metric for downstream data mining.

Figure 3. Overall accuracy vs. average number of questions answered per round of the game, for 24 registered Dizeez players.

4 Discussion and future work

A common concern raised against any form of crowdsourcing in a scientific context is that the ‘crowd’ will not produce high quality data. While it may be true that the average participant in these systems – whether as a player of Dizeez or an editor of the Gene Wiki – may not contribute data of equal quality to that produced by a trained professional, the aggregated labor of many participants can produce useful, high quality resources. (Though see (Good and Su, 2011) for examples of expert player identification through games). This step of aggregation – of filtering and combining contributions from multiple diverse sources – distinguishes crowdsourcing efforts from traditional, professional systems that assume each individual contribution is correct from the outset.

The early results from Dizeez show two key things: 1) a very simple online game can produce a large number of gene-disease associations in a short amount of time and 2) a simple voting system can easily and reliably identify the high quality gene-disease associations within the set contributed by the game players. In future work, we intend to refine the aggregation system by weighting the ‘votes’ from different players based on their ability to reproduce known gene-disease associations during game play.

In addition, we introduced a ‘game review’ functionality with the purpose of connecting players with published information about genes and diseases. This highlights the potential educational aspects of Dizeez and related games.

One fundamental weakness of Dizeez is that players are “punished” when they add potentially novel associations. That is, there is no reward when they add a novel, true annotation. To better the game mechanic with contributions of novel annotations, we are in the process of building a new game, called ‘GenESP’, with a two-player design modeled after Von Ahn’s ESP game. In GenESP, two players are paired anonymously, are both shown the same disease (the “Clue”) and submit genes related to that disease as Guesses. When the players match on a Guess, points are awarded and the game moves on to the next disease. In this way, the game rewards players for submitting gene-disease associations that can be validated by their partner and the reward system is not tied to any pre-existing database.

Figure 4. GenESP game interface.

As GenESP players win more points, the Clues will test increasingly specialized knowledge. When specific gene-disease annotations are well established and validated, that gene can be added to a “Taboo” list for that disease, prohibiting players from using that gene to match (Figure 4). Even though each player always appears to be playing in real time with another user, a realistic playing partner “script” can be retrieved from a past game with the same clue, allowing two people to play together asynchronously.

5 Conclusion

The results from Dizeez provide evidence that online games can be used to help address the growing challenge of structured gene annotation. Through the game, we identified several novel gene-disease annotations that are well established in the literature, but not reflected in any public database. While the individual results presented here must be considered preliminary due to the small scale of this proof-of-concept experiment, they do hint at the tremendous potential of games for crowdsourcing annotation tasks in biology.

Acknowledgements

The authors acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924 to A.I.S).

References

Baumgartner W.A. et al. (2007) Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23, 13, i41-i48.

Huang da, W. et al. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res.
37, 1-13.

Howe D. et al. (2008) Big data: The future of biocuration. Nature,
455, 47-50.

Lintott C.J. et al. (2008) Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Month. Not. Royal Astron. Soc., 389, 1179-89.

Huss, J.W. et al. (2010) The Gene Wiki: community intelligence applied to human gene annotation. Nucleic Acids Res, 38, D633-639.

Gardner P.P. et al. (2011) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res,
39, D141-5.

Kelder T. et al. (2012) WikiPathways: building research communities on biological pathways. Nucleic Acids Res, 40, D1301-7.

von Ahn L. and Dabbish L. (2008) Designing games with a purpose. Commun ACM, 51, 58-67.

Good B.M. and Su A.I. (2011) Games with a scientific purpose. Genome Biol, 12, 135.

Kawrykow A. et al. (2012) Phylo: a citizen science approach for improving multiple sequence alignment. PLoS One, 7, e31362.

Khatib F. et al. (2011) Algorithm discovery by protein folding game players. PNAS, 108, 47, 18949-53

Schriml L.M. et al. (2012) Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res,
40, D940-6.

Good B.M. et al. (2011) Mining the Gene Wiki for functional genomic knowledge. BMC Genomics, 12, 603.

Osborne J. et al. (2009) Annotating the human genome with Disease Ontology. BMC Genomics, 10, S6.

Handcock J. et al. (2010) mspecLINE: bridging knowledge of human disease with the proteome. BMC Med. Genomics, 3, 7

Camon E. et al. (2005) An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics, 6, S17.

Footnotes

* To whom correspondence should be addressed.

No Comments

Leave a comment

Login