on June 29, 2011 by Phillip Lord in 2011, Comments (0)

An Ontology of Annotation Content Structure and Provenance

Abstract

Motivation: Representing and understanding complex biological systems requires knowledge representations that can relate multiple concepts to each other though sets of assertions. Annotation efforts that seek to curate this information require the ability to annotate with more than single ontology concepts or identifiers. We propose an extension to the Information Artifact Ontology (IAO) for representing annotations, including single term annotations as well as annotations containing multiple statements. This extension enables tracking the provenance of annotations in terms of other annotations, as well as the provenance of individual parts of statements in multi-statement annotations.

Authors

Kevin M. Livingston, Michael Bada, Lawrence E. Hunter, Karin M. Verspoor

Center for Computational Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA

Introduction

Representing and understanding complex biological systems requires constructing knowledge representations that go beyond the selection of a single term in a single ontology to describe them. For example, fully representing an event requires capturing information about the type of event, its participants, locations, and other contextual information, into one knowledge structure. All this information together is relevant to any understanding or reasoning done by human or machine with respect to that particular event.

Annotations are one of the primary ways biological information is being curated and distributed. There are two prominent kinds of biomedical annotations: (1) associating ontology terms with genes, gene products, or other entities, such as Gene Ontology (GO) annotation, and (2) associating ontology terms or database identifiers with (typically) textual references in documents, such as the output of the NCBO Annotator (Jonquet et al. 2009).

The primary focus of most annotation work to date has been annotating with single ontology terms or identifiers. These annotations have proven useful, for example, in computing term enrichment or for indexing for search. However, single term annotations do not provide a complete understanding of the biological content they are annotating. The Entrez Gene record for Human TP53 (7157) lists 10 different phenotypes, 79 process or function annotations, and 17 component annotations. It is highly unlikely that all processes and components are associated with each of the phenotypes listed; rather there are various subsets of all of these annotations that are associated with each other in different contexts. Likewise, having a set of ontology terms associated with a document provides far more information than just the textual strings, but it is a long way from a complete understanding of the document content. For example, consider a document annotated with multiple proteins and the ontology term for calcium transport. Viewing those annotations as an unstructured set provides no information as to which protein (if any, or all) may perform that function.

While ontologies strive to be complete, it is likely that specific applications will require concepts that are not explicitly expressed in an ontology and thus require dynamic construction. Furthermore, as information needs increase and annotation efforts expand to cover more complex concepts, knowledge structures containing relations among many parts will need to be represented. Annotators also need the ability to reference existing annotations, or their content, as the provenance for more complex annotations.

We present an extension of the Information Artifact Ontology (IAO) for representing annotation content, i.e., concepts or sets of assertions denoted by annotations. This proposal focuses on annotation content and not on metadata such as author or creation date, but it is consistent with existing models for representing this information. We provide a mechanism for representing both the provenance of annotations in terms of other annotations as well as the provenance of their semantic content in terms of other semantic content. This ontology consistently applies to use cases in both entity-oriented annotation (e.g., GO annotation) and document-oriented annotation (e.g., NCBO Annotator).

Background

There are many formats for recording biomedical annotations. Most associate single ontology terms with genes, gene products, or other entities. Among those that afford more complex representations is the Gene Association Format (GAF 2.0). GAF provides the ability to add “annotation extensions” to a GO annotation that can, for example, further constrain the annotation to occur in a particular location, or to part of another component. The existence of such extensions demonstrates the community’s need for recording more structured annotations. However, they are specific to GO annotations and GAF-formatted data. Our model is generally applicable to all annotations and affords access to Semantic Web tools such as reasoners and visualizations not available to idiosyncratic formats.

There are several proposed RDF-based models for representing annotations over web resources, including but not limited to text. Most of these models, and most of the well studied use cases for annotation (e.g., biological literature curation (Clark et al. 2011), or digital humanities applications (Hunter et al. 2010)) require only the association of an individual concept or database identifier with a given information source such as a text segment (e.g., identifying genes). In contrast, the natural language processing community has developed solutions for representing complex syntax and semantics for documents, such as, full parse trees in the Penn Treebank (Marcus et al. 1993), but these representations are mostly idiosyncratic and not interoperable.

Our own work revolves around two primary use cases although has benefit to the community in general. The first use case is oriented towards publishing semantic content produced by text mining systems in a format that would integrate with existing semantic web tools, such as the Annotation Ontology (Ciccarese et al. 2010) and viewer. Our model is consistent with these existing efforts, while capable of capturing annotations with more structure than single terms.

The second use case is enabling natural language understanding systems that can reason over RDF-based annotations. Specifically Direct Memory Access Parsing (Riesbeck and Martin 1986) systems that use patterns of lexical and semantic elements to recognize and interpret language, e.g., OpenDMAP (Hunter et al. 2008) and REDMAP (Livingston 2009). Our model would allow annotations produced by other systems to be leveraged in producing more complex semantic structures and precisely record their provenance.

 

Fig. 1.Example of kiao:ResourceAnnotation denoting a single rdfs:Resource, a protein. Relevant ontology terms: gray, classes: boxes, instances: ovals, properties: no border. Standard rdf/rdfs namesapces omitted.

Annotation representation

We propose an extension to the Information Artifact Ontology (IAO) to represent the content of annotations, including annotations of single ontology terms or identifiers as well as annotations containing sets of assertions. Annotations can be composed and the provenance of that composition can be fully recorded, both for the annotations themselves and the individual elements of statements in the annotations.

We divide annotations into two primary classes:

ResourceAnnotation, which is an annotation that associates a single RDF resource with a target, and

StatementSetAnnotation, which is an annotation that associates a set of assertions with a target.

This proposal focuses on the structure of the content of annotations and is neutral with respect to annotation schema or annotation content. Details about how to record metadata, such as author and creation date, are therefore elided from this discussion. A base annotation class that is consistent with any of the existing RDF-based annotation methods (e.g., the Annotation Ontology) is assumed and can be treated as a parent class for both ResourceAnnotation and StatementSetAnnotation.

We reuse or extend existing community-curated ontologies where possible. The Information Artifact Ontology (IAO) is a good starting point for annotations. Our in-house knowledge base of biology (KaBOB) is the aggregator of our work; KaBOB extensions of an ontology are named by prefixing the ontology’s namespace with the letter ‘k’. The namespace kiao: is therefore used for our extension of the IAO. Both kiao:StatementSetAnnotation and kiao:ResourceAnnotation are rdfs:subClassOf iao:data item. The ex: namespace is used for examples.

ResourceAnnotation

A resource annotation is an annotation that associates a single rdfs:Resource with a location and is of rdf:type kiao:ResourceAnnotation. The relation kiao:denotesResource is used to associate the resource with the concept being annotated. This property is rdfs:subPropertyOf iao:denotes, which relates an information content entity (in this case an annotation) to something that it is specifically intended to refer to (in this case a rdfs:Resource). Figure 1 depicts an example annotation of an interferon gamma protein (pro:000000017 from the Protein Ontology (Natale et al. 2007)).

For the purposes of alignment with existing annotation models, kiao:denotesResouce could be made rdfs:subPropertyOf a corresponding relation, e.g., ao:hasTopic. A single rdf:Statement could be used as the denoted resource of a ResourceAnnotation (since rdf:Statement is rdfs:subClassOf
rdfs:Resource) or in a single-statement StatementSetAnnotation.

StatementSetAnnotation

While ResourceAnnotation instances denote a single RDF Resource, StatementSetAnnotation instances represent sets of RDF statements. These statements can correspond to any set of RDF triples desired by the annotator. We make no restriction as to which triples are allowed or what they represent; this proposal only recommends how to represent them and assign provenance to the content of the triples and their constituent members.

The triples comprising the content of a StatementSetAnnotation are represented as instances of rdf:Statement. A StatementSetAnnotation is associated with one or more of these reified statements using the property kiao:mentionsStatement; a rdfs:subPropertyOf of iao:mentions. The IAO defines mentions as meaning that the subject of the statement ro:has_part that iao:denotes the object. The example in Figure 2 represents an annotation for a translation (go:0006412 from the Gene Ontology) of an interferon gamma protein (pro:000000017).

The use of reified statements protects users of the annotation from committing to or believing the propositions represented in the annotation unless they want to. For example, one annotation could contain the triple (Earth hasShape Flat). In its reified form, a reader of this annotation is not committed to believing the Earth is flat. What has been represented is effectively “this particular annotation says, ‘the Earth is flat.'” Should a user of a StatementSetAnnotation choose to reason about the contents of an annotation, the statements that it encodes can be recomposed from their reified forms. Again, this proposal is only about the structure and provenance of the semantic content of annotations, not confidence, trust, or other epistemological or modal information, which could be modeled independently.

 


Fig. 2.Example of kiao:StatementSetAnnotation mentioning two statements: (T1 subClassOf go:0006412) and (T1 resultsInFormationOf pro:000000017). See Fig 1 for figure key

Provenance

The ontology extension we propose provides two levels of abstraction for provenance tracking of annotations content.

3.3.1Annotation LevelThe first and simplest is annotation-level provenance. If one annotation builds on another annotation in any way, it can document this relationship using the kiao:basedOnAnnotation property to associate itself with any annotation used in its construction, e.g., between the two previous annotations, see top arc in Figure 3.

This property can be used both when there is a direct relation between annotations, such as one directly using elements of another; or when there is an indirect relationship, such as one annotation being used as the justification for another’s existence even though no parts were shared (e.g., the annotation of a specific gene being used to justify the annotation of the concept “gene” from the Sequence Ontology, so:0000704).

3.3.2Element LevelThe second layer of annotation content provenance is more detailed and allows for tracking the provenance of the individual statement elements of an annotation. This is done by reifying statement elements and relating them to the elements of other annotations. There are two sets of properties used to model statement elements, one set for indicating what statement a particular element is part of, and one set to model what it is based on.

The relation kiao:mentionsStatementElement relates a StatementSetAnnotation to statement elements; the object of this relation is a reified StatementElement. A StatementElement is then associated with the statement in the set that it is part of via one of three relations:

kiao:isSubjectOf, kiao:isPredicateOf, or

kiao:isObjectOf.

These relations correspond to the three positions in a reified Statement that a StatementElement could be representing. A StatementElement is then related to the content that it is based on using one of four properties:

kiao:basedOnResourceOf, kiao:basedOnSubjectOf,

kiao:basedOnObjectOf, or

kiao:basedOnPredicateOf.

The relation kiao:basedOnResourceOf is used to record the provenance of a StatementElement as the denoted resource of a ResourceAnnotation. Fig. 3 depicts the element-level provenance, documenting that statement element ex:E1 is the object of statement ex:S2 and is based on the resource denoted by annotation ex:A1.

Fig. 3.A partial representation highlighting provenance, depicting A2 (from Fig. 2) based on A1 (from Fig. 1). Also depicting a single kiao:StatementElement that is the object of Statement S2, and based on the denoted Resource of A1.

To record the provenance of a StatementElement as an element of another statement, normally in another StatementSetAnnotation, one of the last three “based on” properties can be used to relate it to another reified rdf:Statement, e.g., kiao:basedOnSubjectOf. For example, consider a third annotation representing the regulation of the translation from the second annotation, it and its provenance could be represented as depicted in Figure 4, using “regulation of translation” from the Gene Ontology (go:0006417). The part of Figure 4 below the dashed line shows statement element ex:E3 as the object of statement ex:S4 and is based on the subject of statement ex:S1 (i.e., the translation event ex:T1 from Figure 2).

Just as it is not required that parts of an annotation be directly used in another annotation to make a kiao:basedOnAnnotation statement, it is not required that elements be identical to make kiao:basedOnSubjectOf, etc. statements. For example, if a specific protein is used in the representation of a complex event (e.g., a protein transport event), an annotator can create a generic protein class annotation and document the specific protein element in the statement set as its source.


 

Fig. 4.Example of kiao:StatementSetAnnotation mentioning two statements: (R1 subClassOf go:0006412) and (R1 regulates T1). A partial example of element-level provenance is shown below the dashed line documenting that T1 is based on part of Statement S1 (from Fig. 2). See Fig. 1 for figure key

Conclusion

We have presented an ontology extension to the IAO for representing annotation content and the provenance of that semantic content. The model represents annotation content, i.e., concepts or sets of assertions denoted by annotations, and enables tracking of the provenance of content between annotations at two levels of granularity. Coarse-grained provenance is modeled by linking annotations to the annotation they were based on using a single relation. A small set of relations can further be used to provide far more detailed provenance about each element of semantic structures.

Our model is compatible with existing RDF-based annotation proposals. Because of the layered nature of our model, annotations represented in it could be understood by existing tools largely without change. We will submit this model to the IAO for consideration for inclusion. We believe adoption of this model within the Bio-ontologies community will enable standardization and interoperability of annotations both within that community and further open up these annotations for use in the broader Semantic Web.

Acknowledgements

We appreciate the support of the other members of the CCP. This work was supported by NIH grants R01LM009254, R01GM083649, and R01LM008111 to LH, R01LM010120 to KV, and 3T15 LM009451-03S1 to KL.

References

Ciccarese, P., Ocana, M., Das, S. and Clark, T.. 2010. AO: An Open Annotation Ontology for Science on the Web. Paper presented at the Proceedings of the Bio-ontologies SIG at ISMB 2010, Boston, MA.

Clark, T., Ciccarese, P., Attwood, T., de Waard, A. and Pettifer, S. 2011. A Round-Trip to the Annotation Store: Open, Transferable Semantic Annotation of Biomedical Publications. Paper presented at the Beyond the PDF Workshop, University of California San Diego.

Hunter, J, Cole, T., Sanderson, R., and Van de Sompel, H. 2010. The Open Annotation Collaboration: A Data Model to Support Sharing and Interoperability of Scholarly Annotations. Paper presented at the Digital Humanities 2010.

Hunter, L., Lu, Z., Firby, J., Baumgartner Jr., W. A., Johnson, H. L., Ogren, P. V., and Cohen, K. B. 2008. OpenDMAP: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-specific gene expression. BMC Bioinformatics 9.

Jonquet C., Shah N.H., Youn C.H., Callendar C., Storey M., Musen M. 2009 NCBO Annotator: Semantic Annotation of Biomedical Data, In 8th International Semantic Web Conference, Poster and Demonstration Session, ISWC’09.

Livingston, K. M. 2009. Language understanding by reference resolution in episodic memory: Northwestern University.

Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19.313-30.

Natale D, Arighi C, Barker W, et al. 2007 Framework for a Protein Ontology. BMC Bioinformatics 2007;8:S1

Riesbeck, C. K., and Martin, C. E. 1986. Direct memory access parsing. In Experience, memory, and reasoning, ed. J L. Kolodner and C. K. Riesbeck. Hillsdale N J: L. Erlbaum.

No Comments

Leave a comment

Login