Linking genes to diseases with a SNPedia-Gene Wiki mashup
A variety of topic-focused wikis are used in the biomedical sciences to enable the mass-collaborative synthesis and distribution of diverse bodies of knowledge. To address complex problems such as defining the relationships between genes and disease, it is important to bring the knowledge from many different domains together. Here we show how advances in wiki technology can be used to automatically assemble ‘meta-wikis’ that present integrated views over the data collaboratively created in multiple source wikis. In particular, we introduce a meta-wiki formed from the Gene Wiki and SNPedia whose purpose is to identify connections between genes and diseases. (Supplementary data is available at http://goo.gl/3VYhj).
Benjamin M. Good1+, Salvatore Loguercio2+, Andrew I. Su1*
1Genomics Institute of the Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego, CA, 92121. 2 Technische Universität Dresden, Biotechnology Center, Tatzberg 47/49, 01307 Dresden
One of the key goals of current biomedical research is the elucidation of the relationships that hold between genes, environment and disease. Tackling this complex challenge requires the coordination of information emerging from a variety of scientific and medical communities. Increasingly, wiki technology is being used to enable such communities to collaboratively synthesize and distribute their knowledge. These ‘bio-wikis’ are emerging in many different areas (Callaway, 2010; Waldrop, 2008). We have wikis about genes (Hoffmann, 2008; Huss, et al., 2010; Huss, et al., 2008), proteins (Mons, et al., 2008), protein structures (Stehr, et al., 2010; Weekes, et al., 2010), SNPs (http://www.snpedia.com), pathways (Florez, et al., 2009; Pico, et al., 2008), specific organisms (Florez et al., 2009) and many other biological entities. These bio-wikis have become important concept-centric knowledge resources but no single wiki contains all of the knowledge needed to answer most biological questions. The task of integrating the knowledge across different wikis remains the job of the end user. Recently, three key factors have emerged that make it possible to dynamically produce ‘meta-wikis’ that provide end users with consolidated views of information spanning multiple underlying wikis.
The first factor is the widespread adoption of the MediaWiki software for implementations within the bio-wiki community. MediaWiki installations now provide a powerful web API (Application Programming Interface) for direct, high-level access to the data contained in their databases. The API uses RESTful calls (Fielding, 2000) to permit automated processes to make queries and post changes. Since many of the bio-wikis now have, by default, the same API, this implies that the same software can be used to query and edit the content of many different wikis without alteration.
The second factor is the increasing adoption of standardized systems for describing and recognizing biological concepts across multiple sites. Such systems provide identifiers for genes (e.g. NCBI Gene ids) and other biological concepts (e.g. Gene Ontology terms). These shared names can be used to identify when two different wikis contain knowledge pertinent to the same things and hence provide the key starting point for integration.
The third factor is the Semantic extension to the Media Wiki system (Krotzsch, et al., 2007). By installing this extension, wiki administrators make it possible to add semantic links between articles in the wiki – for example, GeneX hasSNP snpY. These semantic links can be used in queries to the system that are much like queries to a database.
The combination of a consistent API across many different wikis, a growing collection of unifying ontologies and the Semantic extension enables the rapid formation of wiki-mashups or ‘meta-wikis’. Such meta-wikis offer the potential to produce integrated views of the knowledge dispersed across many different sources. Here we show how an automatically generated meta-wiki composed of elements drawn from SNPedia and the Gene Wiki exposes substantially more evidence of links between genes and diseases than either resource contains independently.
SNPedia provides textual information about links between variations in human genes and human phenotypes (http://www.snpedia.com). It uses standard identifiers from trusted authorities – primarily dbSNP (Sayers, et al., 2011) – to enable extensive linking to other public bioinfomatics databases and to personal genomics companies like 23andMe (http://www.23andme.com). It is not a comprehensive listing of SNPs, rather it focuses on SNPs that have some evidence of association with a human phenotype.
The Gene Wiki is an attempt to generate a collaboratively written, continuously updated review article for every human gene (Huss, et al., 2008). It provides textual descriptions of gene function in normal conditions as well as descriptions of the role the gene may play in disease. Currently, it includes more than 10,300 Wikipedia articles about human genes.
Fig. 1. Meta-wiki assembly process. (1a, 1b) Article content is obtained from the source wikis using GET calls to their MediaWiki APIs, and written to the target wiki (2a, 2b) via POST calls to its MediaWiki API. In parallel, the Annotator is used to identify Disease Ontology terms in the text of the gene wiki articles and to map medical conditions in SNPedia to Disease Ontology terms. The content at the target wiki is then enhanced with the Disease Ontology associations generated using the Annotator (3a, 3b).
SNPedia + Gene Wiki
Bringing information from both the Gene Wiki and SNPedia together into one consistent framework allows us to better address the following important question.
“Based on what we know now, what genes are linked to which diseases?”
It is important to note that there is no official database established as yet for structuring and curating such information. The closest example is Online Mendelian Inheritance in Man (OMIM) but there is no way to answer this question given the tools that OMIM provides aside from searching, one by one, through thousands of textual entries. Other groups have attempted to build such resources through text-mining e.g. GeneRIFs (Osborne, et al., 2009) but none has yet emerged as a standard reference.
In the protocol illustrated in Figure 1, we describe how to automatically construct a semantic wiki instance suitable for exploring the relationship between genes and disease both by browsing and through structured queries. The resultant meta-wiki contains semantic relations linking genes to diseases, genes to SNPs, and SNPs to diseases. The steps to build this meta-wiki are as follows:
Install the MediaWiki software with the Semantic extension as the meta-wiki platform;
Utilize the MediaWiki API at the source wikis to pull the articles of the Gene Wiki from Wikipedia and SNP articles related to human genes from SNPedia. Insert them into the mashup using the WriteAPI at the meta-wiki;
Identify Disease Ontology terms in the text of Gene Wiki articles using the NCBO annotator (Jonquet, et al., 2009) ;
Identify SNP-Disease relationships in SNPedia:
Using the SNPedia query API, identify all SNPs with a wikilink directed to or from an article in the SNPedia category ‘medical condition’.
Map SNPedia medical conditions to Disease Ontology terms using the NCBO Annotator.
Add semantic links to the mashup between:
Gene Wiki genes and SNPs (using NCBI Gene Ids as the naming standard);
genes and diseases discovered in step 3;
SNPs and diseases discovered in step 4.
Overall, the SNPedia/Gene Wiki meta-wiki captures 4,426 distinct gene-disease relationships. As illustrated in Figure 2, SNPedia accounts for 1,037 (via gene-SNP-disease connections), the Gene Wiki provided 3,525 (via direct gene-disease associations) and only 136 (3%) of the gene-disease pairs appear independently in both sources. The 136 gene-disease pairs in the overlap contained 47 distinct diseases and 125 distinct genes linked to 271 SNPs. For example, the gene CYSLTR1 is linked to asthma in the text of the Gene Wiki: “The cysteinyl leukotrienes […] are important mediators of human bronchial asthma” and in the text of the SNPedia article on SNP Rs320995 (which occurs in CYSLTR1): “subjects without T-allele in SNP rs320995 had 3.1 times higher risk of asthma“.
Fig. 2. Overlap of gene-disease associations derived from SNPedia and from the Gene Wiki.
As Figure 2 clearly illustrates, both the Gene Wiki and SNPedia contain substantial amounts of knowledge pertinent to the challenge of finding associations between genes and diseases. The low level of overlap between the gene-disease associations found in these resources indicates the potential value of their combination.
RDF and Semantic Media Wiki
One of the key advantages of the Semantic Media Wiki framework is its ability to generate structured exports of the knowledge it contains that adhere to the Resource Description Framework (RDF) standard. This makes it possible to take advantage of the growing collection of tools built on this standard to conduct analysis of the data. For example, the gene-disease pairs in the overlap mentioned above can be identified with the following SPARQL query (SPARQL is the standard query language for RDF).
PREFIX wiki: <http://126.96.36.199/mediawiki/index.php/Special:URIResolver/> SELECT ?gene ?disease ?do_term ?snp
?gene wiki:Property-Is-associated-with ?disease .
?gene wiki:Property-HasSNP ?snp .
?snp wiki:Property-Is-associated-with ?disease .
?disease wiki:Property-Same-as ?do_term .
FILTER regex(?do_term, “^DOID”, “i”) .
Enhancements to the user experience
While the aggregation of data from multiple sources in a queryable, structured form is useful for computational scientists, few ‘end-user’ biologists can be expected to enter SPARQL queries or even queries in the Semantic Media Wiki syntax. For the majority of users, the value of a meta-wiki such as this is in the direct improvements to the individual articles that they will discover while browsing. Hence we made two specific additions to the visible areas of the meta-wiki articles. First, we added a ‘known variants’ table to all the gene articles. This table presents SNPs related to the gene described in the article and phenotypes related to those SNPs drawn from the data gathered from SNPedia. Figure 3 shows the known variants table for the ACCN1 gene. The table materializes a connection between the gene and Multiple Sclerosis (supported by (Bernardinelli, et al., 2007)) that was missing from the original ACCN1 article.
Fig. 3. Example of the ‘known variants’ tables added to the Gene Wiki articles from data collected from SNPedia. Here showing a SNP on the ACCN1 gene linked to Multiple Sclerosis.
In addition to the enhancements to the gene articles, we added a ‘related genes and SNPs’ table to the disease articles (brought in from Wikipedia as part of the Gene Wiki import). This table presents genes and SNPs that are linked to the disease either in the text of a Gene Wiki article or through genetic associations found in SNPedia. Figure 4 shows how the article on Bipolar Disorder has been expanded with a section detailing related genes as well as related SNPs on these genes.
Fig. 4. Example of the ‘related genes and SNPs’ boxes added to the disease articles from data collected from both SNPedia and the Gene Wiki. Here showing some of the genes and SNPs linked to Bipolar Disorder.
The low amount of overlap between the gene-disease relationships found in the gene wiki and the gene-SNP-disease relationships from SNPedia is likely the result of differences in both the protocol used to mine them and the content itself. It is possible that we would obtain a higher amount of overlap if we used the same procedure to find SNP-disease associations as we did to find direct gene-disease associations and this would be a useful experiment to conduct in future work. However, based on our inspections of the data the more important driver of the low overlap appears to be the basic differences in the core content of SNPedia and in the Gene Wiki. There are many reasons why a particular gene might be associated with a disease in a gene wiki article that do not implicate a particular SNP. For example, genes may be involved in pathways known to be important to disease pathogenesis or to the body’s immune response while there may not be any known SNPs associating that gene with that disease.
One of the weaknesses of the approach used to build this meta-wiki is that it represents a one-way sync. If editors make changes to the articles in the meta-wiki, there is currently no automated mechanism for migrating those changes back to the articles in the original wikis. While one option is to let these meta-wikis evolve independently of their parents, a better approach might be to establish mechanisms through which edits made to a meta-wiki article could flow back into the articles used to create them. Such a mechanism would effectively extend the reach of the source wikis – both in terms of exposing their contents and of acquiring more editors. There are tools emerging that will make this possible. For example, the Distributed Semantic Media Wiki system is an extension that enables the creation of a network of Semantic Media Wiki servers that share common semantic wiki pages (Skaf-Molli, et al., 2010). With such a system in place, we might imagine that meta-wikis like the one discussed here could serve not only as new integrated resources for consuming information but also new points for users to contribute information back to the community collection.
We have demonstrated how a high-level linking of genes and diseases can be accomplished through the meta-wiki approach, but we have not touched on the deeper, more difficult question of how these genes are linked to these diseases. To address this complex challenge, the work of thousands of specialists needs to be assembled into integrated wholes that can be understood and used to drive action. The topic-focused wikis emerging in different areas of biology represent one step of this process of collaborative knowledge synthesis. Looking forward, meta-wikis such as the one presented here offer the potential to go one step further – to help unearth and present the latent relationships that exist between different concepts and different communities.
Thanks to Mike Cariaso for suggesting how to extract SNP-disease relationships from the hyperlinks in SNPedia.
This work was supported by NIGMS (GM083924).
Bernardinelli, L., et al. (2007) Association between the ACCN1 gene and multiple sclerosis in Central East Sardinia, PLoS One, 2, e480.
Callaway, E. (2010) No rest for the bio-wikis, Nature, 468, 359-360.
Fielding, R. (2000) Architectural Styles and the Design of Network-based Software Architectures. Doctoral dissertation, University of California, Irvine.
Florez, L.A., et al. (2009) A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki, Database (Oxford), 2009, bap012.
Hoffmann, R. (2008) A wiki for the life sciences where authorship matters, Nat Genet, 40, 1047-1051.
Huss, J.W., 3rd, et al. (2010) The Gene Wiki: community intelligence applied to human gene annotation, Nucleic Acids Res, 38, D633-639.
Huss, J.W., 3rd, et al. (2008) A gene wiki for community annotation of gene function, PLoS Biol, 6, e175.
Krotzsch, M., et al. (2007) Semantic Wikipedia, Journal of Web Semantics, 5, 251-261.
Mons, B., et al. (2008) Calling on a million minds for community annotation in WikiProteins, Genome Biol, 9, R89.
Osborne, J.D., et al. (2009) Annotating the human genome with Disease Ontology, BMC Genomics, 10 Suppl 1, S6.
Pico, A.R., et al. (2008) WikiPathways: pathway editing for the people, PLoS Biol, 6, e184.
Sayers, E.W., et al. (2011) Database resources of the National Center for Biotechnology Information, Nucleic Acids Res, 39, D38-51.
Skaf-Molli, H., Canals, G. and Molli, P. (2010) DSMW: Distributed Semantic MediaWiki., Proceedings of ESWC 2010, 2, 426-430.
Stehr, H., et al. (2010) PDBWiki: added value through community annotation of the Protein Data Bank, Database (Oxford), 2010, baq009.
Waldrop, M. (2008) Big data: Wikiomics, Nature, 455, 22-25.
Weekes, D., et al. (2010) TOPSAN: a collaborative annotation environment for structural genomics, BMC Bioinformatics, 11, 426.