Motivation: We present a novel visualisation approach to generate a karyotype-variation dataset derived from the NHGRI GWAS catalogue diagram using an OWL knowledge base. Use of OWL allows dynamic querying of the previously static diagram, and annotation of the data with ontology terms supports structured queries for classes of phenotype, disease and sample treatment.
Availability: A beta version of the diagram showing a subset of studies is available at
Danielle Welter1,*, Tony Burdett1, Lucia Hindorff2, Heather Junkins2, Simon Jupp1, Jackie MacArthur1 and Helen Parkinson1
1EMBL-EBI, Wellcome Trust Genome Campus, Hinxton
2Office of Population Genomics, National Human Genome Research Institute, NIH, Bethesda
Genome-wide association studies (GWAS) aim to identify the common genetic variants associated with complex diseases and traits by testing a minimum of hundreds of thousands of single-nucleotide polymorphisms (SNPs) in large population samples (Manolio, 2010). Genotyped SNPs mark regions of linkage disequilibrium (LD) within which a causative variant of a disease may be found although they are not necessarily the causative variant. Unlike single-gene disorders, complex diseases are the result of a combination of genetic factors, each of which increases susceptibility to the condition, rather than of one causative variant. As many trait-associated SNPs are not necessarily part of the known central mechanisms of a condition, they can be particularly difficult to elucidate. By adopting a non-candidate-driven whole-genome approach in large case/control sets, it is possible to overcome the constraints of candidate-gene driven studies (Manolio, 2010).
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies or GWAS catalogue provides a quality controlled, manually curated, literature-derived collection of all published GWA studies assaying at least 100000 SNPs and all SNP-trait associations with p-values < 1.0 × 10-5 (Hindorff et al., 2009). Publicly available at http://www.genome.gov/gwastudies, the catalogue can be searched by a number of options including journal, first author, trait, chromosomal region, gene, SNP, odds ratio and p-value threshold. As of 06/04/12, it includes 1233 publications and 6186 SNPs. In addition to the SNP-trait association data, the catalogue also publishes the iconic GWAS diagram (Fig. 1) of all SNP-trait associations mapped to the SNPs’ chromosomal locations. The diagram is released quarterly and the latest version of the diagram is made available on the GWAS catalogue website in PDF format and as a PowerPoint slide.
At present, the curation of the GWAS catalogue is a manually intensive process. Individual SNP-trait associations must first be manually mapped to an internal MeSH (Rogers, 1963) derived controlled vocabulary of trait descriptions. Then, in order to produce the diagram, the known associations are filtered by p-value, the level of specificity to define a trait as unique is determined by expert screening (for example, Type 1 diabetes, Type 2 diabetes, and other related traits may be collapsed into “diabetes”) and redundancies and duplications are removed from a spreadsheet by a manual screening process. Finally, the remaining curated associations are placed on the diagram by hand by a graphic artist. The SNP-disease associations are not distributed evenly across the genome, posing the challenge of displaying data points efficiently at chromosomal regions which are particularly dense, such as the MHC. In total, this quarterly process takes several weeks and a substantial number of (expert) person-hours.
Due to the considerable increase in SNP-trait associations since the first version of the diagram and the number of different phenotypes in the color-coded legend, it is currently almost impossible to identify traits by visually analysing the diagram, though comparison with previous versions provides an elegant visual representation of the scale and complexity of the data. As the diagram is only produced in PDF and PowerPoint format, it is static and cannot be clicked, searched or filtered. Although updated versions of the diagram illustrate the evolution of GWAS association over successive quarters, it is currently not possible to automatically produce customised versions showing only certain traits, a request often received by the GWAS catalogue team.
Figure 1: An extract from the GWAS diagram. The full diagram, which is too large to reproduce here can found at http://www.genome.gov/gwastudies/. Traits, represented by coloured circles, are mapped to the cytogenetic band in which their associated SNP is located.
We present a novel approach for automating the creation of the GWAS diagram using an ontology and scalable vector graphics (SVG), an XML-based language for describing geometric objects. This makes it possible to create an up-to-date, dynamic diagram that can be filtered and searched at different levels of granularity and by different criteria, including trait, chromosomal region and time. It is possible to zoom in over chromosomes in order to allow users to see all SNP-trait associations for a given region. SNP-trait associations are also interactive, providing summary information on mouse-over as well as being clickable to allow the
to proceed from an association to the catalogue entry and to the publication.
As part of the wider “GWAS Ontology and Curation Infrastructure” (GOCI) project, the phenotypic traits in the GWAS catalogue, which until recently were available only as an unstructured flat list partially mapped to MeSH, were integrated into the Experimental Factor Ontology (EFO) (Malone et al., 2010). Representing GWAS traits in an expressive knowledge representation language like OWL will allow for much richer queries over the GWAS catalogue. By choosing an established ontology like EFO, the long term maintenance of these terms is assured and also provides the potential for future integration of the GWAS catalogue with other resources already consuming EFO.
EFO is an application ontology that utilises multiple reference ontologies such as Phenotype, Attribute and Trait Ontology (PATO) (Gkoutos et al., 2005) and Chemical Entities of Biological Interest (ChEBI) (de Matos et al., 2009) to produce a single ontology which can be viewed under one hierarchy. Much of the coverage provided by EFO meets the needs of the GWAS catalogue. Current GWAS traits range from simple disease descriptions such as “breast cancer” to measurements such as “waist- hip ratio” or “liver enzyme levels” to very complex traits like “body mass in chronic obstructive pulmonary disease”, and trait definitions are often context- dependent. EFO not only contains disease categories (such as MeSH), but also phenotypic descriptions, compound treatments, and so on.
At the start of the integration process, around 20% of all GWAS traits were already described in EFO. Full coverage of existing GWAS traits is expected to be achieved by September 2012. Support for adding further traits to EFO in the future will be provided. This ties in with the wider strategy of on-going development in EFO, which includes extension and maintenance tooling based on the Mireot (Courtot et al., 2009) guidelines.
The GWAS traits form only a subset of the total number of terms covered by EFO. In order to identify this subset within EFO a basic OWL annotation mechanism is used. We standardise the way application specific views are encoded in EFO using the Simple Knowledge Organization System (SKOS) (www.w3.org/2004/02/skos) (Jupp et al., 2012). This GWAS centric view will be used to construct user interface components that will support users in both querying and navigating the GWAS catalogue and diagram. The GWAS view is available for download from Bioportal (Noy et al., 2009) at bioportal.bioontology.org/ontologies/2099.
In addition to EFO, we created a GWAS schema ontology (viewable at wwwdev.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram.owl) that models the relationships between GWAS catalogue concepts to be used in the diagram such as “SNP”, “trait” and “chromosome” as well as importing all of EFO. Our GOCI DataPublisher package, which was specifically designed for this purpose, pulls all published information from the GWAS catalogue relational database and converts each data element into an OWL individual of the appropriate ontology class. The resulting OWL knowledge base is then processed by an ontology reasoner to form the basis of the automated GWAS diagram.
By representing the GWAS catalogue as an OWL knowledge base, and reasoning over the asserted axioms, we can make inferences about SNP-trait associations that are not possible in the current relational database catalogue implementation. This enables more expressive queries, the ability to detect errors or inconsistencies in the data, and powerful maintainable visualisations.
The current GWAS catalogue query mechanism is based entirely on direct string matching or by selecting individual terms from the list of over 400 available traits. As the list of catalogue traits is unstructured, it is not possible to query for even closely related diseases, unless their names are very similar. Searching for “diabetes” for example would return results for type 1 and type 2 diabetes. However a comparison between gastric and esophageal cancers would either require two distinct searches or a high-level “cancer” search returning results for all cancer-related traits.
Adding structure to the trait list by integrating it into an ontology offers more query options. The new GWAS catalogue knowledge base allows flexible querying at different levels of granularity. In the case of the above cancer example, possible queries could range from “Find all SNPs that are associated with gastric cancer” to “Find all SNPs that are associated with cancers located in the upper digestive tract” to “Find all SNPs that are associated with cancer”.
In addition to the wider range of queries available for the knowledge base, the visualisation component adds to the query power as well as to readability. For example, for trait-based queries, the visual result allows users to identify patterns such as clusters of SNPs related to a disease or disease family. Queries that are very difficult or even impossible in the current query interface include ranges of p-values, odds ratios and publication dates. These will be visualised as user operated sliders for defining ranges of values or time frames to be displayed on the diagram. OWL queries can be used to combine as many different variables as necessary into a single query, such as for example “Find all SNPs located on chromosomes 5, 7, 15 and 21 and that are associated with diseases located in the urinary tract, with a p-value smaller than 10-8, added to the catalogue between 01/2010 and 06/2011″. It may however be necessary to limit such queries to a predefined range or to precompute, depending on scaling needs. Another option worth exploring in the case of major scalability issues would be the generation an RDF triple store from the knowledge base and the use of SPARQL queries instead of OWL class expressions.
The GWAS diagram system is implemented as a web application with a Java back-end using an Apache Tomcat server. At start-up time, the GWAS database is classified and reasoned over using the Hermit reasoner (Motik et al., 2007). Hermit was chosen as the DL expressivity of the GWAS OWL ontology is SHOIF(D), which is beyond the expressivity supported by the ELK (Kazakov et al., 2011) reasoner. There was prior experience of scaling issues with Pellet (Sirin et al., 2007) and this reasoner also did not perform as well as Hermit in some initial informal tests.
Following processing, the reasoned view of the GWAS OWL ontology is cached in memory for use by the diagram generator for the duration of the server’s lifetime. Reasoning over the entire GWAS catalogue takes approximately 24 hours on a standard desktop machine. This is a further reason for caching the inferred view as reasoning from scratch for each query would be impractical. The reasoned knowledge base can be viewed at wwwdev.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram-data.owl.
A diagram generation request is sent by the web client in the form of an AJAX request. The server converts this request into an appropriate OWL class expression which is processed by the reasoner and translated into a string of SVG. The resulting SVG is returned to the client and can be rendered by most modern web browsers, included the latest versions of Internet Explorer, Chrome, Firefox and Safari.
Most of the SVG is generated programmatically, with the exception of the chromosomes. In order to maintain as much as possible of the look and feel of the original static GWAS diagram, the chromosome ideograms were reused. The original image files used to generate the PDF were converted into SVG format and manually modified to ensure that all drawing paths are given in the same format. These paths were then used to derive the positional information of all other diagram elements.
Figure 2: Screenshot of the GWAS diagram available on the beta site for a subset of 12 studies. The diagram displays the new colour scheme and illustrates the display of trait names on mouse-over.
Chromosomes are rendered by default, whether an instance of each chromosome class is part of the OWL class expression or not. This is because there are no known trait-associated SNPs on the Y-chromosome, while other chromosomes have very few known associations. If chromosomes were rendered based on the presence of an individual of the respective chromosome class, the Y-chromosome would always be absent from the karyotype, while other chromosomes would only be displayed for some queries and this is unnatural for the biologists viewing the diagram.
All other aspects of the diagram are rendered only if the query contains an instance of the respective OWL class. The default class expression sent to the server is for the full diagram and therefore renders every individual in the ontology. If the knowledge base is generated from the complete GWAS catalogue (all p-values), this results in 20410 individuals for the current version of the database. During development, in order to avoid the large classification times, a subset of 3 to 10 studies is used for testing purposes.
Once the full diagram has been generated, the user can refine the diagram by for example filtering out any number of traits or trait categories or by zooming in on one chromosome. Rather than re-render the diagram from scratch each time, the server analyses the previously generated SVG and adds or removes elements as required to produce a faster response.
A recent user experience session with internal users of the catalogue and diagram provided useful feedback on the implementation of the dynamic GWAS diagram and identified its key use cases. These include being able to filter the diagram by p-value and by trait. The current static diagram only shows SNP-trait associations with p-value < 1.0 × 10-8, but it would be useful to either visualize all associations or make p-value selection even more stringent. It is also essential that the diagram can be filtered by trait or trait categories in order to accommodate the needs of different communities of users.
Typically with semantic web technologies, scaling can be a problem. A limitation of the current implementation is the time taken to generate the knowledge base. New SNPs are added to the GWAS catalogue on a daily basis but the current classification time makes it impractical to update the knowledge base at a similar frequency. Fortnightly or monthly reclassifications are currently being considered as possible options. Whilst such time frames would not reflect 100% of the live catalogue in the diagram at all times, they still represent a substantial improvement on the quarterly release cycle of the current static diagram.
We present a novel approach for generating the iconic NHGRI GWAS diagram using OWL and SVG. A knowledge base containing in the GWAS catalogue is generated using a schema ontology to model the relationships between different concepts. This knowledge base can then be queried by trait, SNP, chromosome, study and any other concept usually available in the GWAS catalogue. The resulting OWL class expression is then rendered in a web application using SVG.
Reasoner speed is currently a limitation in the diagram generation process. Classification times of over 24 hours for the full catalogue mean that it is impractical to reclassify the ontology every time a new study is added to the catalogue. However even with this limitation, the dynamic diagram still benefits from a faster release cycle than the current quarterly diagram. A prototype of the image showing a subset of about a dozen studies (see Fig. 2) is available at wwwdev.ebi.ac.uk/fgpt/gwas. We expect a full release of the GWAS diagram by June 2012.
We thank the NIH NHGRI for funding Danielle Welter (grant 3U41-HG006104), NCBO for funding Simon Jupp (grant U54-HG004028) and EMBL for funding Tony Burdett and Helen Parkinson. We would also like to thank Peggy Hall, Darryl Leja and Teri Manolio for testing and user feedback.
Bier, E., Stone, M., Pier, K., et al. (1993) Toolglass & Magic Lens: The See-Through Interface. Proc. SIGGRAPH ’93. 73-80.
Courtot, M., Gibson, F., Lister, A.L., et al. (2009) MIREOT: the Minimum Information to Reference and External Ontology Term. International Conference on Biomedical Ontology.
Gkoutos, G.V., Green, E.C., Mallon, A.M., et al. (2005) Using Ontologies to Describe Mouse Phenotypes. Genome Biol. 6, R8.
Jupp, S., Gibson, A., Malone, J., et al. (2012) Taking a view on bio-ontologies. In preparation; International Conference of Biomedical Ontology 2012.
Hindorff, L.A., Sethupathy, P., Junkins, H.A., et al. (2009) Potential Etiologic and Functional Implications of Genome-Wide Association Loci of Human Diseases and Traits. Proc. Natl. Acad. Sci. USA. 106, 9362–9367.
Kazakov, Y., Krötzsch, M., and Simančík, F. (2011) Concurrent classification of EL ontologies. Proceedings of the 10th International Semantic Web Conference (ISWC-11).
Malone, J., Holloway, E., Adamusiak, T., et al. (2010) Modeling Sample Variables with an Experimental Factor Ontology. Bioinformatics. 26, 1112–1118.
Manolio, T.A. (2010) Genomewide Association Studies and Assessment of the Risk of Disease. N. Engl. J. Med. 363, 166–176.
de Matos, P., Alcantara, R., Dekker, A., et al. (2009) Chemical Entities of Biological Interest: an Update. Nucl. Acids Res. 38, D249–D254 (2009)
Motik, B., Shearer, R., and Horrocks I. (2007) Optimized Reasoning in Description Logics using Hypertableaux. Proc. CADE-21. 67–83.
Noy, N.F., Shah, N.H., Whetzel, P.L., et al. (2009) BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37(Web Server issue), W170-3.
Rogers, F.B. (1963) Medical Subject Headings. Bull. Med. Libr. Assoc. 51, 114–116.
Sirin, E., Parsia, B., Grau, B.C., et al. (2007) Pellet: A practical OWL-DL reasoner, Journal of Web Semantics, 5(2), 2007.
* To whom correspondence should be addressed: firstname.lastname@example.org.