DOMEO: a web-based tool for semantic annotation of online documents
DOMEO (Document Metadata Exchange Organizer), is an extensible web application enabling users to visually and efficiently create and share ontology-based annotation metadata on HTML or XML document targets, using the Annotation Ontology (AO) RDF model. The tool supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control. DOMEO is the user-facing component of the SWAN Annotation Framework. DOMEO creates AO RDF, linking text strings within the document to term URIs in scientific – particularly biomedical – ontologies, as stand-off annotation, supporting full positional metadata on web documents without requiring update control of the target. AO RDF is orthogonal to any domain ontology by design, and therefore widely applicable to ontology driven annotation and curation tasks across many biomedical and scientific domains. AO metadata is an example of so-called “stand-off metadata”, being managed separately from the annotation target.
Paolo Ciccarese*, Marco Ocana**, and Tim Clark*†‡
*Harvard Medical School and Massachusetts General Hospital, Boston MA; **Balboa Systems, Newton MA
†University of Manchester, School of Computer Science, Manchester UK.
Last year, for Bio Ontologies 2010, we presented Annotation Ontology (AO), an OWL ontology providing a model for creating ‘stand-off’ annotation anchored to online resources such as documents, images and databases and their fragments [1-3]. AO provides a robust set of methods for linking online resources, for example text in scientific publications, to ontological elements, with full representation of the annotation provenance. Through AO, existing domain ontologies and vocabularies – in OWL or SKOS – can be utilized, out of the box, for creating extremely rich stores of metadata on web resources. In the bio-medical field, subjects for ontological structuring include biological processes, molecular functions, anatomical and cellular structures, tissue and cell types, chemical compounds, and biological entities such as genes and proteins. However, it is important to keep in mind that AO is not limited to the bio-medical domain and can be easily used in other scientific and non-scientific contexts. In fact, AO is already used by other projects focusing on biodiversity  and social tagging [6, 7].
AO, by linking new scientific content to computationally defined terms and entity descriptors, can help establish semantic interoperability across the diverse masses of specialist science embodied in digital media — from journals, to wikis and blogs [8, 9], to the growing world of web-based research “collaboratories”. Annotation – either marking up contributions with comments, or more importantly, with relevant concepts and entities from biomedical ontologies – provides a technological boost to “strategic reading” for members of such communities [11, 12] and can selectively breach established specialist focus boundaries and semantic barriers . In biomedicine, semantic interoperability facilitates cross-species comparisons, pathway analysis, disease modeling, and the generation of new hypotheses through data integration and machine reasoning.
While AO provides the model for encoding and sharing annotation in the convenient RDF (Resource Description Framework) format, it is still necessary to develop software applications allowing the users, in our case bio-medical scientists, to manually or semi-automatically create, share/publish, search and utilize annotation, and to manage algorithmically created annotation. As we strongly believe developing actual software is required to test the exchange model format against real use cases, we developed AO in parallel with the SWAN Annotation Framework, a web application suite with a rich set of features including (i) semantically annotating – manually or semi-automatically – online HTML and XML documents; (ii) sharing the annotation in RDF; (iii) searching the annotation while leveraging semantic inference. DOMEO is the user interface component of the SWAN Annotation Framework.
We present DOMEO in the following sections.
THE DOMEO ANNOTATION TOOL
The tool is currently in alpha release with approximately 50 alpha users. It was developed upon an initial set of requirements accumulated in developing curation-intensive biomedical knowledge bases and scientific online communities. Requirements were approximately equally distributed across:
The ALZSWAN knowledge base – a customization of the Semantic Web Applications in Neuromedicine (SWAN) platform for Alzheimer Disease – developed in collaboration with the Alzheimer Research Forum (http://www.alzforum.org). It organizes 184 hypotheses and 2,214 specific scientific claims, with relevant evidence, referring to 266 gene-protein groups and 2,567 bibliographic resources (http://hypothesis.alzforum.org).
The Science Commons Antibodies Resource – an OWL model for formally representing antibodies as referred to in the scientific literature. Developed in collaboration with Science Commons and the Alzheimer Research Forum, this knowledge base required intensive curation of existing relational databases as well as of the documents provided by the antibody vendors 
PDOnline  (http://pdonlineresearch.org) – a web portal and forum for the Parkinson Disease researcher community, collecting several relevant resources including extensive online discussions by scientists. The resources can be annotated according to PDGuide controlled vocabulary where terms are organized as taxonomy.
Pain Research Forum (http://painresearchforum.org) – a web portal and forum for the Pain Research Community.
After the first alpha release of DOMEO, in October 2010, we initiated an intense social process across several categories of potential users, to assure a constant flow of use-cases as well as continuous and valuable community feedback. To guarantee the desired level of coverage and flexibility of the application, we connected with a variety of collaborating partners including pharmaceutical companies, a major scientific publisher (Elsevier), a philanthropy (the Spinal Muscular Atrophy Foundation) and several academic groups with different capacities and goals. Many inputs also came from the W3C Health Care and Life Sciences (HCLS) Interest Group (http://www.w3.org/blog/hcls) and, in particular, from the HCLS Scientific Discourse sub-task. Valuable use cases have also been provided by several academic groups specializing in text-mining.
MANUALLY CREATING ANNOTATION
DOMEO was designed to blend in with the scientists’ everyday workflow. The DOMEO user loads a specified URL into the applications and then will see DOMEO-specific menus in a bar just above the normally-displayed document (Figure 1). The user can then manually annotate the whole document, or sections of it by simply selecting the desired portion of text, attaching a topic or, in other words, an instance of one of the several available annotation types.
Figure 1: DOMEO is a user interface component that allows loading and annotating any HTML document.
The simplest annotation item that can be created is the semantic tag or in AO terms the ‘Qualifier’. The tool allows attaching ontology or vocabulary terms –this can be any term identified by a URI – to a document or document fragment. The process is enabled by a user interface that performs the search operation by connecting to an external web service. The currently deployed alpha version of the tool connects to the NCBO (National Center for Biomedical Ontology) BioPortal REST web service for ontology-driven entity identification [16, 17]. The text-hit results are presented in a linear list. Alternative ontology search and exploration tools with expanded features and improved algorithms are under development by several of our collaborators, along with web service interfaces to DOMEO for a variety of text mining algorithms.
It is important to note that the tool allows users potentially to connect any search service and therefore to customize the list of available vocabularies. By simply changing the set of vocabularies used for performing the annotation, it is possible to tackle domains other than biomedicine.
Once the annotation – in this case a qualifier – is created, the annotated span of text of the document is visually detectable. It is also possible to click on the span of text to inspect the annotation items associated to it through a popup (Figure 2).
Figure 2: Users click on the annotated text to inspect the associated annotation items and semantic entities.
DOMEO is extensible. Besides the qualifier, it can allow several other types of annotation through development of additional software components. Already developed plugins include features for modeling scientific discourse according to the model provided by the SWAN ontology  and features for modeling antibody usage. The latter consists of annotating text with one of the antibody entries of antibodyregistry.org and, optionally, with the methods and species involved in the particular study reported in the document content. New annotation types can be added to the tool by developing additional plug-ins to define user interface components, semantic aspects of the new annotation topics, and connectors to external services when needed.
In many cases, the efficiency of mass-scale manual annotation can be significantly augmented by annotation algorithms. DOMEO allows implementing the RECS (Run, Encode, Curate, Share) process. Using this process, it is possible to select and run external text mining services, encode the results in the AO format, display the results in the context of the annotated document (Figure 3) to enable the curation process.
Curation is a crucial aspect of scientific publication and therefore an important aspect for both our annotation ontology and our annotation tool. We enable curation for annotation generated by both humans and text mining services. In the case of automatic generated annotation, the tool allows curators to judge each annotation item (or set of annotation items) according to a configurable set of judgment categories. By default the set of categories is: wrong, right, too broad, unclear – and where unclear means the curator is unable to judge the result. Every time a curator judges and responds to a result, s/he can also provide motivation that can be used later on for further evaluation.
Figure 3: The text-mining results are displayed on the document and the curation popup lets the user review and respond to automatically generated annotation items
As several users may produce annotation on the same document, several users or curators may therefore curate the same results. The annotation tool enables both concurrent and collaborative annotation and curation processes.
PROVENANCE, ACCESS CONTROL AND RDF SHARING
In working with online scientific communities, we are particularly aware of the importance of provenance tracking for establishing trust and properly documenting evolution of the science. AO offers a rich set of properties for modeling provenance based on the Provenance Authoring and Versioning (PAV) ontology originally developed for the SWAN project . Our annotation tool tracks all the provenance aspects transparently while the user performs the annotation process. For every piece of annotation and annotation curation, the tool records the originating user, date, and the specific version of any software or web service involved. The annotation and curation items, together with all the provenance data, can be then serialized in RDF format according to the AO model. Serialization includes RDF representing aspects of the domain ontologies used in any annotation, as well as the AO RDF itself.
Figure 4: Annotation Sets access control
DOMEO also implements another feature of AO: the Annotation Set, a mechanism for grouping annotation items. The notion of an Annotation Set was included in AO to assist in
annotation organization. Sets can be used, for instance, to collect items of the same type – i.e. proteins or genes –, to show/hide multiple items, and to define the corresponding access policy. Using the annotation tool it is also possible to define, for each set, which users will be able to access the annotation items (Figure 4): only the creator (personal annotation), selected groups, or everybody (public annotation).
Fifty alpha testers currently use DOMEO. The number is planned to double shortly when the beta release candidate becomes available. With the beta release, many additional features now in development will be brought into production. One important feature for the beta will allow integration with the Apache UIMA framework so that textminers using that architecture will be able to display and curate the results their text mining with our tool. With this tool, and the collaborations currently in place, we expect to be able to publish large quantities of high quality annotations on scientific documents in RDF AO format. The published annotation will include the content of the AlzSWAN knowledge base (http://hypothesis.alzforum.org) with the discourse elements – claims, hypotheses, and questions – linked to the correspondent text in original papers. We also note that annotation produced with our tool can be displayed on the corresponding PDF documents in the Utopia application [19, 20] as Utopia can now consume AO RDF. We are currently working to with the Utopia group to enable the opposite workflow: producing annotation on a PDF of a scientific paper, and displaying it on the HTML version.
Development of The SWAN Annotation Framework, including DOMEO, has been funded by a grant from the National Institute on Drug Abuse, National Institutes of Health, as part of the Neuroscience Information Framework; by a grant from EMD Serono, Inc. as part of the MS Discovery Forum project; by a grant from Elsevier; and by a grant from Eli Lilly and Company. We are most grateful for the financial support of these organizations.
We thank Professor Maryann Martone and Anita Bandrowski of the University of California at San Diego; Anita deWaard, Bradley Allen and Antony Scerri of Elsevier B.V., Adam West and Ernest Dow of Eli Lilly and Company, and Carole Goble of the University of Manchester, for their continuing support and for many fruitful discussions and much joint work. We also thank Steve Pettifer of University of Manchester for the work toward integration of our tool with the Utopia PDF annotator.
1. Ciccarese P, Ocana M, Das S, Clark T: AO: An open annotation ontology for science on the Web. In: Bio Ontologies 2010: July 9-13, 2010 2010; Boston MA, USA.
2. Ciccarese P, Ocana M, Garcia-Castro LJ, Das S, Clark T: An Open Annotation Ontology for Science on Web 3.0 BMC Bioinformatics in press.
3. The Annotation Ontology on Google Code [http://code.google.com/p/annotation-ontology/]
4. McGuinness D, van Harmelen F: OWL Web Ontology Language. W3C Recommendation 2004.
5. Miles A, Bechhofer S: SKOS Simple Knowledge Organization System Reference. W3C Recommendation 2009.
6. Tags4Labs [http://www.biotea.ws/node/3]
7. Garcia-Castro A, Labarga A, Garcia L, Giraldo O, Montaña C, Bateman JA: Semantic Web and Social Web heading towards Living Documents in the Life Sciences. Web Semantics: Science, Services and Agents on the World Wide Web 2010, 8(2-3):155-162.
8. Waldrop M: Big data: Wikiomics. Nature 2008, 455(7209):22-25.
9. Waldrop MM: Science 2.0. Scientific American 2008, 298(5):68-73.
10. Bos N, Zimmerman A, Olson J, Yew J, Yerkie J, Dahl E, Olson G: From shared databases to communities of practice: A taxonomy of collaboratories. Journal of Computer-Mediated Communication 2007, 12(2):article 16.
11. Renear AH, Palmer CL: Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 2009, 325(5942):828 – 832.
12. Shotton D, Portwin K, Klyne G, Miles A: Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Comput Biol 2009, 5(4):e1000361.
13. Das S, Goetz M, Girard L, Clark T: Scientific Publications on Web 3.0. In: 13th International Conference on Electronic Publishing (ELPUB 2009): 10-12 June 2009; Milan, Italy. 2009.
14. Science Commons Semantic Resources Project: Antibody Resource [http://neurocommons.org/page/Semantic_resources_project/Antibodies]
15. Das S, Rogan M, Kawadler H, Corlosquet S, Brin S, Clark T: PD Online: a case study in scientific collaboration on the Web. In: Workshop on the Future of the Web for Collaborative Science, 19th International World Wide Web Conference: April 26-30, 2010 2010; Raleigh, NC, USA.
16. Jonquet C, Musen MA, Shah N: A system for ontology-based annotation of biomedical data. In: International Workshop on Data Integration in the Life Sciences 2008, DILS’08: 2008; Evry, France.
17. Jonquet C, Musen MA, Shah NH: Help will be provided for this task: Ontology-Based Annotator Web Service. In. Stanford, CA: Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine; 2008: 16.
18. Ciccarese P, Wu E, Wong G, Ocana M, Kinoshita J, Ruttenberg A, Clark T: The SWAN biomedical discourse ontology. J Biomed Inform 2008, 41(5):739-751.
19. Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D: Calling International Rescue: knowledge lost in literature and data landslide!
Biochemical Journal 2009, 424(3):317-333.
20. Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D: Utopia documents: linking scholarly literature with research data. Bioinformatics 2010, 26(18):i568-i574.