on October 4, 2012 by Phillip Lord in 2012, Comments (0)

Analysis of MeSH and IPC as a Prerequisite for Guided Patent Search

Abstract

Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, but they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since all patents are assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed.

In order to develop a system for this task, it is absolutely necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.

Our analysis shows a strong structural similarity of the hierarchies, but big differences of terms and annotations. The low number of IPC class assignments and the lack of literal occurrence of class definitions in patent texts imply that current patent search is severely limited. We propose a system for guided patent search that overcomes these limits by using class co-occurrence information and ontological resources.

Authors

D. Eisinger1,2,*, T. Wächter1, M. Bundschus2, U. Wieneke2 and M. Schroeder1

1TU Dresden, BIOTEC, Tatzberg 47/49, 01307 Dresden

2Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg

1 Introduction

The Medical Subject Headings (MeSH) are used as a document indexing system in the biomedical literature database PubMed. MeSH terms are used to automatically improve the recall of PubMed searches through query expansion: By mapping keywords from a search query to MeSH terms, relevant documents are included in the search results even if they only contain synonyms or hyponyms of the original keyword.

Patents are classified into hierarchical systems of categories by patent offices. The International Patent Classification (IPC) is the most commonly used system, and its hierarchy constitutes the base of other important systems such as the European ECLA. It is considerably less accessible than MeSH: Alphanumeric codes are used instead of easily understandable terms, and class definitions are complicated and depend on each other. Even so, classification data is important for patent search because pure keyword search is often not good enough: patent language is very complicated and many companies use very unspecific vocabulary in order to make the scope of their patents as broad as possible. On the other hand, it is of vital importance to the same companies to find relevant competitor patents.

As a consequence of this high search complexity, big pharma companies employ patent professionals to relieve their scientists of this difficult and time-consuming task. Many researchers without access to such resources have been ignoring patents in favor of more accessible scientific literature. However, they risk missing a lot of current research results that are not published in journals before patent protection has been granted.

Consequently, it is desirable to offer scientists an easier option to formulate patent queries that include classification information. In order to provide such assistance, it is essential to have a clear understanding of the properties of both classification systems. In this paper, we therefore investigate differences between the IPC and the established MeSH hierarchy and their implications for patent search. As a solution to problems we discovered, we propose a guided patent search system that assists the user by offering query expansion suggestions derived from class co-occurrence data or using knowledge from external ontologies.

2 Related Work

The importance of MeSH for the biomedical field has led to extensive research: There are mature approaches for automatically assigning MeSH terms to documents (Trieschnigg et al., 2009, Tsatsaronis et al., 2011), and MeSH terms are successfully used for query expansion (Lu et al., 2009). MeSH has also been used in combination with patents, e.g. for tagging diseases (Gurulingappa et al., 2009). IPC-related research is much more limited than for MeSH, but scientific interest has been growing over the last decade. Most publications cover methods for the automatic assignment of classes to patents (e.g. Tikk et al., 2008) or the use of existing assignment information for prior art search (e.g. Szarvas et al., 2009).

To date, there is no in-depth analysis of either hierarchy and no systematic comparison of both hierarchies.

3 Comparative Analysis

Our analysis of MeSH and IPC concerns the respective hierarchies and terms of the systems themselves as well as their usage for document categorization, which we analyzed by collecting classification information from all patent applications to the European Patent Office (EPO) between 1982 and 2005 (over one million) as well as the annotations to all PubMed documents published by early 2011 (over 20 million). Our analysis has the goal of assisting patent search; we are therefore less interested in the reasons for any discrepancies than in their implications for search. Table 1 summarizes some core results of our analysis, and subsections 3.1 and 3.2 give more detailed reports.

Table 1.
Comparative analysis MeSH vs. IPC. The hierarchical structures are similar, but MeSH terms are shorter and more likely to occur in text. The number of MeSH annotations per document far surpasses the number of classes per patent.

Property

MeSH

IPC

Number of hierarchy entries

54095

69487

Number of hierarchy levels

13

14

Avg. string length labels/definitions

17.9

50.2

Avg. number of synonyms

7.8

0

Literal occurrence in text

frequent

rare

Avg. number annotations per doc.

8.93

1.92

Documents with multiple annotations

86.4%

52.8%

Documents with related annotations (same hierarchy tree)

80.7%

45.5%

3.1 Hierarchies and Terms

As Table 1 shows, the structural comparison of the hierarchies did not reveal any significant differences: Their sizes are in the same range (about 70000 IPC classes and 54000 entries in the MeSH tree), they have almost the same depth (14 levels for IPC, 13 for MeSH) and the node distributions are very similar (cf. Figure 1). The terms on the other hand show two major differences:

Emphasis on terms/concepts versus identifiers:

While MeSH assigns identifiers to its headings, the emphasis is clearly on the term itself. The IPC on the other hand is first and foremost a collection of alphanumeric codes which are signifying their place in the hierarchy instead of their semantic meaning. This information is instead contained in additional class definitions that are more akin to MeSH’s scope notes.

Length of terms/occurrence in text:

As Table 1 shows, IPC definitions are usually much longer than MeSH terms. Most of them are also considerably more abstract and complicated, and many of them depend on their ancestor classes, e.g. “containing materials, or derivatives thereof, of undetermined constitution” (class A61K 8/96). Unlike MeSH, the IPC also does not include any synonyms for the class definitions. All of these differences contribute to a very low probability of literal occurrences of IPC definitions in text, while MeSH terms occur frequently.

Fig. 1. Terms/classes per hierarchy level. Both hierarchies expand in similar ways.

As a consequence, IPC classes cannot be assigned to patents by simply extracting them from text. This is one of the main reasons for the much more extensive use of automatic (pre-)annotation of PubMed documents compared to patents. One possible approach to solving this problem is the assignment of classes using machine learning methods, i.e. training a classifier on existing classification data to predict assignments for new data. The World Intellectual Property Organization (WIPO) has made a patent categorization tool available (https://www3.wipo.int/ipccat/).

3.2 Usage for Document Classification

IPC and MeSH are both used as classification/annotation systems for documents: all patent applications are assigned at least one IPC code, and all articles in participating journals are annotated with appropriate MeSH terms. As Table 1 shows, the average number of MeSH annotations is much higher than the average number of assigned IPC classes.

We also measured the diversity of IPC classes/MeSH terms assigned to the same document as follows: Given a hierarchy H (in our case either MeSH or IPC) and two entries a and b of the hierarchy, we define the distance of a and b as the length of the shortest path between them in H. For a subset A of H consisting of all annotations to a single document, we then define the minimal and maximal annotation distances as the minimum and maximum over the pairwise distances of elements of A. As Figure 2 shows, the classes which are assigned to a single patent are usually much less diverse than the MeSH annotations of a single document: Over 83% of all patents are classified into only one of the eight main trees that IPC consists of. Since the main trees correspond to extremely general domains such as “Human Necessities”, we believe that some aspects of many patents are not covered by the currently assigned classes.

Fig. 2.
Hierarchical distance of multiple annotations assigned to the same document (left: PubMed/MeSH, right: patents/IPC). Both the maximal distance and the difference between maximal and minimal distances are much larger for PubMed documents.

3.3 Problems for IPC-based Search

It could be argued that some of the described discrepancies are likely caused by differences in classification guidelines between patent offices and PubMed and may therefore be intended. However, that does not change the fact that both IPC and MeSH are used to improve search results on document corpora, and we believe that the use of IPC for that purpose comes with two serious disadvantages.

Complexity of the classification system: The complexity of the IPC terms causes big problems for non-professional patent searchers. On the other hand, the exclusive use of keywords for searching patent databases usually leads to bad results due to the complicated language used in patents.

Sparse class assignments: The low number of class assignments in combination with the relatively close relation of co-assigned classes (see section 3.2) indicates that relevant aspects of many patents are not covered by the existing class assignments. Consequently, the recall of patent searches using the classification may be lower than expected.

Given these disadvantages, patent search engines should offer additional functionality for helping the user find the required results. Since the class definitions are needed to understand the meaning of the class codes, the system must include easy access to them. Additionally, since many definitions depend on their ancestor classes, the engine should give the user an easy overview over the relevant parts of the hierarchy. Unfortunately, many existing engines make it needlessly difficult for the user to access this basic information.

4 Guided Patent Search

We propose tackling the aforementioned problems caused by the low number of class assignments by expanding user queries to make up for the “missing” assignments. In most cases, professional patent search queries are a combination of class codes and keywords. We investigated ways to expand both components.

4.1 Proposing Additional Classes

If a user query contains a class code, it can be assumed that the user is confident of the relevance of that class. In order to find closely related classes to suggest to the user, we analyzed the class co-assignments in our patent corpus. We collected all pairs of classes that were assigned to the same patent and ranked them both on the absolute number of co-assignments and the relative number in the form of their Jaccard-Index. We hypothesize that pairs of classes with high ranks in either ranking are related closely enough that many searches for one of the classes will also have additional relevant results in the second class.

Many resulting class suggestions are from the same hierarchical tree but not directly related, i.e. they cover patents with very similar aspects to the ones searched for. Additionally, the rankings include pairs of classes from completely separate parts of the hierarchy that are also highly related; in many cases, they can be considered to represent different points of view. Figure 3 shows one example of such a pair of classes, including their definition hierarchy. While the left class is more application-oriented than the right one, we argue that many searchers interested in patents from one class will also find relevant patents in the other one. Patent searches on a complete corpus revealed that searching for only the first or only the second class leads to over 50% missed possible results in the first case and over 25% in the second.

4.2 Proposing Additional Keywords

Classes included in the query can also be used as a source of additional keywords. We investigated various ways of doing that.

4.2.1 Extracting keywords from classes

The most straightforward way of turning class codes into keyword suggestions is by considering the corresponding IPC definitions – both by extracting keywords directly and exploiting the morphosyntactic structure of definitions where possible. For definitions containing lists of related terms, we use the system described in (Fabian et al., 2012) to find additional terms with the same relation and suggest the top-ranking ones to the user. As an example, the suggestions for the IPC definition “Orthopaedic devices […] such as splints, casts or braces” include the relevant terms “slings”, “collars”, and “crutches”. For a baseline Boolean keyword query simply connecting the terms with “OR”, the result set almost doubles in size after the inclusion of the generated sibling terms. Our system detected 3053 IPC classes (≈4%) that contain enumerations and can therefore in principle be used in this way for query expansion.

Taking this approach one step further, established NLP techniques can be used to extract keywords from the patent texts that belong to the query classes.

4.2.2 Extracting keywords from external ontologies

Existing ontologies are another possible source for additional keywords. If an ontology term can be matched to an IPC class definition, any additional information contained in the ontology about the term (e.g. its synonyms) can be used to add suggestions for the user. As a proof of concept, we used the annotation pipeline from (Doms, 2009) to map MeSH terms to an IPC subset with biomedical relevance. For that purpose, we selected all subclasses of the IPC class “A61K” with the definition “Preparation for medical, dental or toilet purposes” (981 subclasses). The annotation results provided at least one MeSH term for 865 of these classes (88%), and three or more terms for 466 classes (48%). Since our system proposes expansion terms to the user instead of automatically adding them, this high level of coverage represents a valuable addition despite the inclusion of some false positive annotations. The availability of a domain ontology also makes enhanced sibling generation possible: If an IPC definition contains a MeSH term as well as one of its child terms in the form of an example, it is reasonable to assume that all other child terms are also relevant. Following this intuition, IPC definitions of this form (e.g. “Sulfonylureas, e.g. glibenclamide, tolbutamide, chlorpropamide”) lead to term suggestions with very high precision (for the example: Carbutamide”, “Acetohexamide”, etc.). Of the biomedical IPC subset, this was possible for 72 classes (7%).

Fig 3. Example for semantically related IPC classes without any hierarchical relation, detected using co-assignment information.

4.2.3 Repurposing class-keyword mappings for class suggestion

After keywords have been mapped to IPC classes in the proposed ways, the mapping data can also be used in the opposite direction: If the user enters a keyword that has been mapped to an IPC class, this class can be suggested to the user for expanding his query. If the class definition is displayed with the suggested class code, even users unfamiliar with the IPC can profit from classification information. This is especially true for the biomedical domain, since the availability of detailed domain ontologies leads to very precise class suggestions. The WIPO website offers similar functionality (www.wipo.int/tacsy), but it is not made clear what the system’s class proposals are based on.

5 Conclusion

We investigated possibilities for giving patent searchers access to the same advantages that are offered for PubMed through MeSH annotations. Our analysis of MeSH and IPC showed some unique characteristics of the patent domain, most importantly complex class definitions that rarely occur in text as well as a low number of class assignments. These discrepancies must be considered during the development of a patent retrieval system. We proposed ways to overcome these problems by developing specialized components for a guided patent search system. Our experiments showed that class co-occurrence data can provide valuable information to users and that existing ontologies such as MeSH can benefit the patent searcher by including existing domain knowledge.

References

Doms, A. (2009) GoPubMed: Ontology-based literature search for the life sciences. Ph.D. thesis, TU Dresden.

Fabian, G. et al. (2012) Extending ontologies by finding siblings using set expansion techniques. Bioinformatics (accepted).

Gurulingappa, H. et al. (2009) Patent Retrieval in chemistry based on semantically tagged named entities. TREC 2009.

Lu, Z. et al. (2009) Evaluation of query expansion using MeSH in PubMed. Inf Retr Boston, 12(1).

Szarvas, G. et al. (2009) Prior art search using International Patent Classification codes and all-claims-queries. CLEF 2009.

Tikk, D. et al. (2007) A hierarchical online classifier for patent categorization. Emerging Technologies of Text Mining: Techniques and Applications. Idea Group Inc.

Trieschnigg, D. et al. (2009) MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11).

Tsatsaronis, G. et al. (2012) A maximum-entropy approach for accurate document annotation in the biomedical domain. JBMS (accepted).

Footnotes

* To whom correspondence should be addressed: daniel.eisinger@biotec.tu-dresden.de

No Comments

Leave a comment

Login