on June 28, 2011 by Phillip Lord in 2011, Comments (0)

The State of Bio-Medical Ontologies

 

*Abstract

This paper presents a logic based analysis of the bio-medical ontologies that are contained in the NCBO BioPortal Repository. In total, 218 OBO and OWL ontologies were analyzed using entailment checking and justificatory structure based analysis. It was found that approximately half of all BioPortal ontologies fit into the tractable OWL2EL profile of OWL, with the other half being built in a variety of expressive fragments, that range from ALC through to the full expressivity of SROIQ that underpins OWL2. Moreover, BioPortal contains a large number of logically rich ontologies that have large numbers of non-trivial entailments and non-trivial reasons for these entailments.

 

Authors

Matthew Horridge, Bijan Parsia and Ulrike Sattler

School of Computer Science, The University of Manchester

Introduction

In recent years the number of publically available ontologies in the bio-medical arena has grown significantly. Many of these ontologies have been made available via the NCBO BioPortal ontology repository [1]. At the time of writing, BioPortal provides access to the imports closures of over 250 bio-medical ontologies in various formats, and is of interest to various consumers of ontologies from ontologists to tool builders. In particular, the BioPortal corpus provides ontologies that: (1) Vary greatly in size (2) Vary greatly in expressivity; (3) Are real world ontologies, as opposed to reasoned test-bed ontologies; (4) Were designed and built by users (domain experts) for application purposes.

This paper presents a logic based analysis of these ontologies, that provides an insight in into the logical richness of real world, state-of-the-art, ontology construction. In contrast to many of the other published analyses, which are purely syntactic, a logic based analysis, in particular the use of justifications, makes it possible to “go under the hood” and detect the rich interplay of axioms in ontologies that cannot be determined by simple expressivity metrics alone.

The presented results should be of interest to many bio-ontology consumers, from ontology developers through to ontology tool implementers.

Preliminaries

The work presented in this paper focuses on understanding the BioPortal corpus from a logic based perspective. In particular, focusing on (non-trivial) entailments, or inferences, and the reasons for these entailments. For this reason, the subset of ontologies written in OWL and OBO formats were used in the experiments that follow.
This section provides a brief overview of OWL, OBO and some of the terminology used later in the paper.

OWL The latest version of the Web Ontology Language, OWL 2, became a W3C recommendation in October 2009. An OWL 2 ontology may be regarded as being a set of axioms, which make statements about the domain of interest. OWL 2 is a highly expressive ontology language, and features a rich set of axiom and class constructors. In particular it allows complex class expressions to be built, so for example, it is possible to describe the class of cells that have at least one nucleus. These complex class expressions can then be used in axioms, which state the relationships between them.

OBO In the bio-ontology arena, there is another, widely used language, called OBO. This language has good tool support in the form of OBO-Edit, and has a popular flat file format which is easy to read and edit in a regular text editor. Despite the fact that OBO is often described as a simple ontology language that is easy for biologists and domain experts to understand, it is in fact a highly expressive language. Indeed, there is a close relationship between OBO and OWL 2, and it is possible to faithfully translate the logical aspects of an OBO ontology into an OWL 2 ontology. Several mappings between OBO and OWL have been proposed, but the mapping that was used for the experiments described in this paper is the one described and documented by Mungall et Al

Entailments and Reasoning One of the key features of OWL (and essentially OBO), is that it is a logic based ontology language, which means that OWL (OBO) ontologies are amenable to automated reasoning. This means that it is possible to use “off the shelf” reasoners to perform various reasoning tasks such as consistency checking, checking for unsatisfiable classes, computing whether one class is a subclass of another class, and whether an individual is an instance of a class. These tasks can be seen as entailment checking tasks. In OWL, an entailment may be regarded as a statement, or more correctly an axiom, that follows from an ontology or a subset of an ontology, which is itself a set of axioms. The process of reasoning is used to compute whether or not an entailment holds in an ontology. Entailments may be asserted directly into an ontology or may be inferred from other axioms in the ontology. For example, the AminoAcid ontology does not directly state that Methionine is a subclass of LargeAliphaticAminoAcid, but, due to other axioms in the ontology, it entails this axiom.

Justifications In the OWL world, justifications are a popular form of explanation for entailments in ontologies. For any given entailment in an ontology, there will be one, or more, justifications that explain why the entailment holds. To be more precise, a justification is a minimal subset of an ontology that is sufficient for an entailment to hold. A justification is minimal in the sense that the entailment in question does not hold in any proper subset of the justification. An example of a real justification for an entailment in a BioPortal ontology is presented later in this paper, but as simple abstract example consider the ontology O = {A SubClassOf B, B SubClassOf D, B SubClassOf C}, which entails A SubClassOf C. A justification for this entailment is the minimal set of axioms {A SubClassOf B, B SubClassOf C}.

Non-Trivial Entailments In the presentation of results that follows, the notion of a non-trivial entailment, is used. In the context of this paper, a non-trivial entailment is an entailment that has at least one justification that is not itself. In other words, either a non-trivial entailment is not directly asserted into an ontology, or, there is a further reason (justification) as to why the entailment holds besides being it being directly asserted.

Materials and Method

Apparatus All experiments were performed on a 3.06 GHz Intel Core 2 Duo MacBook Pro, with a maximum of 4 GB allocated to the Java virtual machine. Three reasoners were used: JFaCT, which is a Java version of the FaCT++ reasoner, HermiT, and Pellet.

Corpus The BioPortal ontology repository was accessed on the 12th March 2011 using the BioPortal RESTful Service API. In total, 261 ontology documents (and their imports closures) were listed as being available. Out of these, there were 125 OWL ontology documents, and 101 OBO ontology documents, giving a total of 226 “OWL compatible” ontology documents that could theoretically be parsed into OWL ontologies.

Each listed OWL compatible ontology document was downloaded and parsed by the OWL API. Any imports statements were recursively dealt with by downloading the document at the imports statement URL and parsing it into the imports closure of the original BioPortal “root” ontology. If an imported ontology could not be accessed (for whatever reason) the import was silently ignored.

Out of the 226 OWL compatible ontology documents that were listed by the BioPortal API, 7 could not be downloaded due to HTTP 500 errors, and one ontology could not be parsed due to syntax errors. This left a total of 218 OWL and OBO ontology documents that could be downloaded parsed into OWL ontologies. After parsing, four of the ontologies were found to violate the OWL 2 global restrictions. In all cases, the violation was caused by the use of transitive (non-simple) properties in cardinality restrictions. These ontologies were discarded and were not processed any further in the entailment checking experiments.

Procedure Each ontology was checked for consistency. Next each consistent ontology was classified and realized in order to measure reasoner performance and extract entailments for inspection. It should be noted that, due to practicalities, a time out of 30 minutes of CPU time per ontology for the tasks of consistency checking, classification and realization was imposed. Entailed (both asserted and inferred axioms) direct subsumptions between named classes (i.e. axioms of the form A SubClassOf B) were extracted, along with direct class assertions between named individuals and named classes (i.e. axioms of the form b Type A). These kinds of entailments are of interest because they are the kinds of entailments that are exposed through the user interfaces of tools such as Protégé, and are therefore the kinds of entailments that users of these tools are interested in, and typically seek justifications for. Next, the set of entailments for each ontology was filtered in order to split them into trivial and non-trivial entailments. Finally, justifications for each non-trivial entailment were computed.

Results

Space limitations prevent the direct inclusion of results in this paper, but detailed tables, with classification times, number of non-trivial entailments, and number of justifications per entailment are available at: http://owl.cs.manchester.ac.uk/bio-ontologies.

Size and Expressivity The average number of logical axioms per ontology was 20,532 with a standard deviation of 115,163 and maximum number of 1,484,923.

 

Table 1 Class Constructor Usage

Class Constructor 

# Onts 

Occurrences Per Ontology 

   

Mean 

StDev 

Max 

ObjectSomeValuesFrom

133

7841

38731

351672

ObjectAllValuesFrom

46

367

1275

7757

ObjectMinCardinality

32

14

59

340

ObjectMaxCardinality

16

3

5

24

ObjectExactCardinality

32

17

45

257

ObjectHasValue

8

5

3

9

ObjectIntersectionOf

61

1628

7202

54038

ObjectUnionOf

69

60

243

2024

ObjectComplementOf

23

10

20

100

ObjectOneOf

18

6

10

48

 

Table 2 Axiom Type Usage

Axiom Type 

# Onts 

Occurrences Per Ontology

   

Mean 

StDev 

Max 

AsymmetricObjectProperty

4

2

1

2

ClassAssertion

48

11470

43549

232642

DataPropertyAssertion

28

35906

165907

896647

DataPropertyDomain

51

16

23

118

DataPropertyRange

53

26

64

449

DifferentIndividuals

14

6

10

42

DisjointClasses

84

576

2395

20238

DisjointObjectProperties

3

7

8

19

EquivalentClasses

71

509

1846

10757

EquivalentObjectProperties

3

3

2

5

FunctionalDataProperty

42

21

54

338

FunctionalObjectProperty

52

15

46

337

InvFunctionalObjectProperty

26

17

64

337

InverseObjectProperties

61

28

65

475

IrreflexiveObjectProperty

5

2

1

3

ObjectPropertyAssertion

24

13092

53724

268578

ObjectPropertyDomain

69

33

46

259

ObjectPropertyRange

66

35

47

268

ReflexiveObjectProperty

9

4

3

9

SameIndividual

1

1

0

1

SubClassOf

200

12529

56568

513246

SubDataPropertyOf

13

46

122

466

SubObjectPropertyOf

73

47

123

958

SubPropertyChainOf

6

3

1

4

SymmetricObjectProperty

25

4

4

18

 

Class constructor usage and axiom type usage are shown in the Table 1 and Table 2, where “#Onts” is the number of ontologies that the constructor appears in, and “Occurrences Per Ontology” the usage of the constructor in the ontologies that use it.

A large proportion, 123, of the ontologies correspond to OWL2EL ontologies. The remainder range from the moderately expressive AL family of languages through to, SROIQ, which represents the full expressivity of OWL2.

Reasoner Performance There were three ontologies, for which consistency checking could not be completed within 30 minutes of CPU time. These were: GALEN, the Foundational Model of Anatomy and NCBI Organismal Classification. A complete listing of times per ontology per reasoner may be found at the link at the start of this section.

Inconsistent Ontologies Out of the ontologies that could be processed by one or more of the three reasoners, 5 were found to be inconsistent.

Unsatisfiable Classes Out of the remaining consistent ontologies, 1 ontology contained 9 unsatisfiable classes in its signature.

Non-Trivial Entailments A total of 72 ontologies had non-trivial entailments (direct subclass axioms and direct class assertions). Table 3 shows a summary of total entailments and non-trivial entailments. Note that the percentage of Non-Trivial Entailments (Column 4) is the mean/max percentage across the corpus rather than the percentage of Column 3 to Column 2.

 

Table 3 Average Number of Entailments Per Ontology

 

Total

Non-Trivial

Non-Trivial %

Mean

5509

1549

30.2

StdDev

16030

6187

26.6

Min

7

1

0.03

Max

89468

49537

100

25th Percentile

175.75

35.00

9.66

75th Percentile 

1838.50 

277.00 

44.64 

90th Percentile 

10069.40 

2392.90 

71.82 

 

Justification Metrics For ontologies with non-trivial entailments Table 4 shows the average number and size of justifications per entailment per ontology.

 

Table 4 Average Number and Size Justifications Per Ontology

 

Number

Size

Mean

2.83

3.01

SD

3.44

3.86

Min

1

1

Max

837

37

25th Percentile

1.52

1.14

75th Percentile

2.40

2.52

90th Percentile

5.00

6.61

 

Discussion

Expressivity The BioPortal corpus contains ontologies that vary greatly in expressivity and size. Interestingly, 133 ontologies, well over half of the 218 OWL and OBO ontologies in the repository, use OWL SomeValuesFrom restrictions, with an average of 7841 restrictions per ontology. Other class expression types are used by slightly fewer ontologies, but the usage of AllValuesFrom (46 onts) and UnionOf (69 onts), IntersectionOf (61 onts) and ComplementOf (23 onts) is still notable.

Roughly speaking, BioPortal ontologies can be split into two halves. One half contains OWL2EL ontologies and the other half highly expressive ontologies some of which use the full expressivity of OWL2. With regards to the OWL2EL ontologies, it is not clear whether it was deliberate attempt, or design goal, to remain in a lightweight, tractable fragment of OWL, whether tooling was used to impose a limit on expressivity, or whether it was accidental that these ontologies fall into this profile. In any case, the sizeable number of ontologies that fall into this profile surely vindicates the design of the profile and its inclusion in the OWL 2 specification. The remainder of the ontologies fall into various expressive fragments of OWL, and use features that were introduced into OWL 2. For example, a handful of ontologies use asymmetric and (ir)reflexive properties, property chain axioms, and qualified cardinality restrictions.

Inconsistent Ontologies On closer inspection of the five inconsistent ontologies, it became apparent that not one of the ontologies was inconsistent for trivial reasons (such as a literal not being in the range of a property). Each ontology was natively OWL, and had multiple justifications for the inconsistency with several axioms per justification. One particular example, had around 366 justifications, each roughly 10 axioms in size. While it is odd that inconsistent ontologies were uploaded to the BioPortal, the fact that there are inconsistent ontologies, with non-trivial reasons, indicates the use of fairly expressive class and axiom constructors. It is unclear as to what (OWL) tool chain was used to construct these inconsistent ontologies, and what reasoning, if any, was used during the development of the ontologies.

Unsatisfiable Classes One BioPortal ontology contained 9 unsatisfiable classes in its signature, with each unsatisfiable class having several overlapping justifications for its unsatisfiability. It may seem strange that an ontology with unsatisfiable classes, that could be considered to be bugs, was uploaded to the repository. However, in this case the ontology is a native OBO ontology. It is therefore doubtful that full sound and complete OWL reasoning was used to detect these unsatisfiable classes.

Non-Trivial Entailments and Justifications One of the most surprising aspects of many of the BioPortal ontologies is that a large number of them (72/218) contain large numbers of non-trivial entailments. Recall that non-trivial entailments are entailments for which there is at least one justification that is not the entailment itself. On average, there were roughly 1500 non-trivial entailments per ontology, out of an average of 5500 entailments per ontology. One of the most striking examples, the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment.

In terms of number of justifications, the UBERON ontology has around 4000 entailments with an average of 25 justifications per entailment (SD=60) and one entailment with a staggering 804 justifications. Moreover, this ontology falls into the lightweight OWL2EL profile. It is therefore a case in point that, low expressivity does not necessarily indicate a logically impoverished ontology and a low degree of interplay between axioms in an ontology.

Another noteworthy example, that is logically rich, is the International Classification of Nursing Practice ontology (of SHIF expressivity), which has slightly over 2000 entailments, with an average of 8 justifications per entailment (SD=33.61) with one entailment that has 837 justifications. An example justification from this ontology is shown below in Figure 1. It should be noted that justifications of this ilk, in terms of type, style and length of axioms, are common for each entailment in the ontology, and indeed, for many of the entailments in the other 72 ontologies.

In terms of size, across all ontologies, most justifications were around 3 axioms (SD=3.86) and therefore not trivial single axiom justifications. The largest justification over all ontologies was a massive 37 axioms in size. All of this is even more significant given that the entailments that the justifications are for are direct subclass axioms and class assertions, which implies that the justifications are not simply “long chains” of named subclass axioms.

Summary and Conclusions

This paper has presented an analysis of OWL and OBO bio-medical ontologies that are contained in the NCBO BioPortal. Half of the ontologies use the tractable OWL2EL language, and the other half vary greatly in expressivity, up to the full expressivity of OWL 2. A logic based analysis, which involved checking consistency, computing entailments and then computing justifications for these entailments revealed that a significant proportion of the ontologies have many entailments that have large numbers of sizeable justifications. In essence, the justificatory structure of non-trivial entailments indicates a panoply of logical richness that is present throughout many of the biomedical ontologies in the BioPortal.

References

[1] Daniel L. Rubin et al. (2008) A Web Portal to BioMedical Ontologies. AAAI Spring Symposium Series.

No Comments

Leave a comment

Login