on October 4, 2012 by Phillip Lord in 2011, Comments (0)

A data-driven approach to automatic discovery of prescription drugs in cardiovascular risk management


Objectives: To evaluate a data-driven approach for automatically identifying medications used in the treatment of cardiovascular disease, and consider how these learned rules might be applied to ontology curation, and evaluation.

Methods: We mined the clinical records of a large cardiovascular patient cohort, focusing on their clinical phenotype and their prescribed medications. Machine learning algorithms from WEKA detected rules linking medications to patient’s treatment-status. These rules were then compared to axioms encoded in the NDF-RT Ontology.

Results: For most medications in the dataset we were able to re-discover, with high precision, the prescriptive rules present in the NDF-RT; however, we discovered only 4/19 possible rules for medications linked to Chronic Heart Failure, and no rules for medications linked to Hypertension. We also show that, in some cases, these rules contain more detailed information than is present in the NDF-RT itself.

Conclusion: This experiment demonstrates how data-driven approaches might be used to ameliorate the knowledge acquisition problem for ontology design. We show that the learned rules could be used to evaluate and improve an existing ontology (NDF-RT). We propose that these rules could be used to automatically construct ontological axioms, thus semi-automating the process of de novo ontology construction for a given domain.


Soroush Samadian1, Benjamin M. Good2 , Bruce McManus1 and Mark D. Wilkinson3,*

1 UBC James Hogg Research Center, Institute for Heart + Lung Health, Room 166 – 1081 Burrard Street, St. Paul’s Hospital Vancouver, BC, Canada, V6Z 1Y6

2The Scripps Research Institute, La Jolla, CA 92037, USA

3Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid, Madrid, España

1 Introduction

Today, ontologies have become an indispensable element for carrying out biological and biomedical research, and numerous bio-medical ontologies have been built to assist in both clinical and research activities (for more information regarding the role played by biomedical ontologies in knowledge management, data integration and decision support, please refer to Bodenreider O., 2008). Building such ontologies is labor intensive, requiring a broad range of specialist-expertise, and therefore, both time-consuming and expensive. Thus it would be desirable to automate (or at least semi-automate) the process of ontology-building. Given that there is an abundance of legacy data and far fewer formal structured knowledge systems that describe them, one obvious approach to such automation would be data-driven knowledge discovery.

A wide range of algorithms and approaches have been developed to address automatic knowledge discovery applied to ontology engineering. Of most significant relevance to this study is the recently published CELOE (Class Expression Learning for Ontology Engineering) algorithm (Lehmann J. et al., 2011). CELOE aims to use information about instances in an ontology to infer new class-expression features through Inductive Logic Programming using an approach conceptually similar to that of Good (Good B. M., 2009). We model our current approach after these studies, in that we utilize patterns in the properties of instance data to suggest classification rules; however our approach differs in two meaningful ways. First, the CELOE study started with “structured” data already in an ontological/linked-data framework, whereas we begin with “unstructured”, manually-entered legacy data stored in tabular format. Second, we know a priori that our data – clinical records of prescribed drugs – contains information about prescribing “preference”; i.e., a “ranking” of rules. While the CELOE system provides a ranking of discovered rules from which the user may select one to turn into a formal class-defining axiom, the rankings we are trying to detect are conceptually distinct in that all of the ranked rules are “correct” with varying degrees of coverage. Therefore we would like to be able to preserve this ranked-rule information for the case of non-Description Logic-based classification.

Our primary questions, therefore, are (a) can we utilize a CELOE-like approach to boot-strap the creation of an ontological framework de novo from unstructured data, and (b) can we detect subtleties such as preferential ranking of otherwise valid classification-solutions.

In this study, we mine data from the drug component of legacy clinical datasets in an attempt to automatically discover how drugs are being used to treat diseases. We focus on rule discovery between several well-known cardiovascular risk factors and conditions including Diabetes Mellitus (DM), Hypercholesterolemia (HC), Chronic Heart Failure (HF), Stroke (ST) and Hypertension (HT) and the medications prescribed to such patients. Discovered rules are then evaluated; however, in the absence of a true “gold standard” for correct prescriptions, we compared our results against the widely used National Drug File Reference Terminology (NDF-RT1).

2 Materials and Methods

2.1 Dataset and Data Collection

The dataset used for our experiments involved the clinical records of a cardiovascular patient cohort collected from a referral hospital in Nebraska, USA, between 1986 and 1989, including 536 unique patients over 1523 “encounters”. Each encounter is considered the status of the patient as a specific point in time. A number of clinical observations, medications, and treatment status were recorded for every patient. The ethical approval was obtained by our expert clinician prior to data collection. Though the data was provided to us anonymized, it cannot be made publically available due to privacy concerns. Table 1 shows a small section of the dataset used in this study. For instance the first patient (row 1 in Table 1) is prescribed Vasotec, Perstantine and etc. at the specific time the data is recorded. Additionally the patient is considered “under treatment” for Hypertension and Diabetes; however, the explicit explanation as to “why” this patient is considered to be under treatment is not recorded.

Table 1. The first two rows of the dataset used in the original format. In the last four columns “1” represents “treated” and “0” represents a “not treated” for the condition listed in the header.

2.2 The NDF-RT Ontology

The NDF-RT is a public resource developed by the Department of Veterans Affairs (VA) and is available in OWL (Lincoln M. J. et al., 2004). NDF-RT is a concept-oriented terminology – a collection of concepts, each of which has a single, unique “meaning”. NDF-RT’s medications are described in terms of their active ingredients (has_ingredient), mechanisms of action (has_MoA), physiologic effects (has_PE), and therapeutics (may_treat, may_prevent). Of particular interest in this study is the “may_treat” property as it links medications with clinical phenotypes they are intended to treat. This knowledgebase is used as the standard against which we evaluate our predicted rules.

2.3 Preprocessing and Data Transformation

Medication in our dataset was recorded as a combination of brand-names, brand-names with specified doses, generic names, active moieties, abbreviations, and acronyms, all of which were prone to mis-spellings and other typographical variations. Thus to maximize the efficacy of our data-mining we first needed to standardize the data elements.

We used Rx_Norm (Peters L. et al., 2008) and Rx_Nav2 to harmonize the disparate medication names in our dataset. After harmonization, a unique ID was assigned to each medication corresponding to the numeric identifiers in the NDF-RT Ontology. In total, 150 unique medications were detected in our dataset. We then removed the medications which occurred less than three times, resulting in a final total of 56 unique and standardized medications. A matrix was then constructed associating each patient with each medication, where values of “1” and “0” indicate that the patent is prescribed the medication, or not, respectively. Additional columns are then added to each patient row showing whether the clinician annotated them as being treated or not treated for Diabetes, High Cholesterol, Hypertension, Stroke, and Heart Failure, where any given patient may be simultaneously treated for numerous clinical disorders with a wide array of drugs. Based on prior experience from other similar studies (Good B. M., 2009) we selected the JRip algorithm from WEKA (Witten H. I. et al., 2011) to discover relationships between medications and treatment status. We predict we should discover disorder-drug associations in this dataset corresponding to the “may_treat” relationships in NDF-RT. The example shown in Figure 1, shows how the OWL axiom “may treat some Diabetes” present in the NDF-RT ontology could be learned directly from recorded uses of this drug in patient records. This basic pattern can be used to expand class definitions, to evaluate definitions based on real world data and to provide continuous estimates of confidence for qualitative logical definitions.

Figure 1.
Inferring components of formal class definitions from instance data. The axiom “may treat some Diabetes” (DM) is learned from legacy data automatically and can be added to an existing OWL-DL Class.

3 Results

The rules discovered by JRip of most interest in this study are of the form “If (Medication), Then (Condition)”. Consider the first rule generated for diabetes (DM).

    (INSULIN = 1) => DM = Treated (14.0/0.0)

The above rule can be interpreted in natural language as follows. If Insulin is prescribed for a patient, the patient is treated for diabetes, and this rule is true for 14 instances with no exceptions (numbers in the parenthesis at the end of the rule).

Below we report the rules learned for each condition using the entire dataset and the results of 6-fold cross-validation experiments in terms of accuracy (acc), precision (P), and recall (R). The cross-validation metrics provide a measure to estimate the likely performance of this system on new data sets from a purely computational perspective while the rules themselves can be assessed based on clinicians’ knowledge.

Diabetes (DM) acc: 90.63 % P: 0.89 R: 0.62

(INSULIN = 1) => DM= Treated (14.0/0.0)

(CHLORPROPAMIDE = 1) => DM =Treated (4.0/0.0)

(GLIPIZIDE = 1) => DM =Treated (6.0/1.0)

(GLYBURIDE = 1) => DM = Treated (3.0/0.0)

(TOLBUTAMIDE = 1) => DM = Treated (2.0/0.0)

=> DM= Not_Treated (163.0/11.0)

Hypercholesterolemia(CH) acc:72.4 % P:0.83 R: 0.53

(GEMFIBROZIL = 1) => CHOL= Treated (23.0/4.0)

(LOVASTATIN = 1) => CHOL= Treated (17.0/2.0)

(CHOLESTYRAMINERESIN = 1) => CHOL= Treated (8.0/0.0)

(PROBUCOL = 1) => CHOL= Treated (3.0/0.0)

=> CHOL=Not_ Treated (141.0/47.0)

Stroke (ST) acc: 91.15 % P: 0.63 R: 0.78

(WARFARIN = 1) => ST = Treated (28.0/8.0)

(PENTOXIFYLLINE=1) => ST = Treated (5.0/2.0)

=> STROKE = Not_Treated (159.0/3.0)

Heart Failure (HF) acc: 89.58% P: 0.30 R: 0.19


=> HF = Treated (7.0/3.0)


=> HF = Treated (6.0/2.0)

=> HF = Not_Treated (179.0/8.0)

Hypertension (HT) acc: 71.3% P: 0.71 R: 1.0

=> HT=Treated (192.0/55.0)

4 Discussion

Diabetes: All of the six medications discovered by JRip are in a may_treat relationship to diabetes in the NDF-RT, and only these six drugs from our dataset have this relationship in the NDF-RT, indicating that our rule discovery was comprehensive (see Table 2). Additionally, JRip provides a ranking of medications by the order in which the rules are generated. With the exception of METFORMIN, which has become the “drug of choice” in recent years, but was not present in our data (since it was not available in United States until 1995), the rest of the medication rules were discovered in the same order of ranking vis a vis actual diabetes treatment recommendations3 indicating that we are able to detect and codify quite detailed usage patterns.

Hypercholesterolemia: Similar to the Diabetes case, JRip was able to discover all four medications linked to Hypercholesterolemia in the NDF-RT. However, unlike the diabetes case, the first medication discovered by the rule is primarily used to treat hyperlipoproteinemia and is only a second-line therapy for Hypercholesterolemia according to DrugBank (Wishart S. D. et al., 2007).

Table 2.
An abbreviated, exemplar table of medications and diseases-of-interest compared by the presence (x) /absence(-) of a may_treat association in the NDF-RT







































Stroke: JRip discovered both drugs used to treat stroke, in order of their recommendation in existing guidelines. Warfarin, which is an oral anticoagulant, has been the first medication of choice for stroke treatment (and prevention) for over 50 years (Ruff C. T. et al., 2010).

Heart Failure: The rules discovered by JRip for HF have a different form from the previously described rules in that their antecedents are each composed of a logical conjunction of two constrains. This is due to JRip’s parsimony in rule selection which finds the most comprehensive conjunctive rule. Though the translation of these rules into axioms is not as straightforward as the previous cases, all four of the medications identified by JRip are in fact used to treat chronic heart failure. However there are 19 medications (Not all of them are shown in Table 2) in our dataset that are linked to the condition in the NDF-RT. It is likely that JRip was not able to find rules for those drugs due to the scarcity of records annotated as “treated for HF” (14 records = ~7% of the dataset) compared to the large diversity of medications that are used for HF treatment.

Hypertension (HT): JRip was unable to find any interesting pattern in the data for HT. Contrary to the case for missing rules in HF, the missing rules for HT are almost certainly the result of a disproportionately high fraction of records annotated as Hypertensive (71% of the records) and also large number of medications used for the condition (16 medications), which makes any detected pattern more likely to occur by chance. The wide variety of drugs prescribed may be due to the maze of contra-indications, co-morbidities, and physician preferences, making it difficult to achieve “signal” over the noise of real clinical data. This problem might be resolved by using a larger dataset or reducing the number of co-dependencies in the data; i.e., building a dataset that focused specifically on the hypertension phenotype.

5 Conclusion and Future work

Data Driven Knowledge Discovery: This pilot study demonstrates how legacy patient data can be mined to discover interesting patterns that mirror the expert knowledge encoded in modern clinical ontologies. We propose that these patterns can, therefore, facilitate semi-automatic knowledge acquisition and ontology building by automatically identifying rules that could be used to “bootstrap” formal ontological axioms where none exist. The overall consistency of our results in matching the NDF-RT may_treat property, suggests that knowledge-bases constructed from these rules would have a relatively high level of accuracy compared to those constructed manually, thus providing an inexpensive and easily-constructed ontological “scaffold” which can then be manually edited by tools such as those provided by the CELOE project. While in this case our results represent, in a way, a “self-fulfilling prophecy” (since the clinicians whose data we are examining were following the rules that we are trying to automatically discover) this will not always be the case. There are a wide variety of datasets, both clinical and otherwise, for which rules are not known a priori (for example, consider the extension of the analysis to non-chronic and non-common conditions). This study demonstrates that we can, with some considerable precision, automatically generate at least a set of template classification rules for such datasets.

Potential for Improvement of NDF-RT: The JRip algorithm generated rules that can be used to rank medications in terms of their relevance to specific conditions. This pilot study was not done on a large enough dataset to be conclusive; however we provide some preliminary evidence that properties such as may_treat can further be refined into more granular formal-logic properties (e.g. “optimal_treatment_for“), or could be used to generate more statistical (non-DL) classification rules. We acknowledge, however, that rating medications is highly subjective and, given clinician-to-clinician variations and “habits”, may be difficult to determine with any accuracy4. Nevertheless, by combining rankings from a large number of diverse datasets, a better measure of a given drug’s efficacy may be obtained. This observation is quite important in the context of ontology maintenance and evaluation, since it provides an approach to measuring the “correctness” or “precision” of an ontology relative to any given data-set by simply comparing the automatically-generated rules to the ontologically defined rules. Using such data-driven approaches, an ontology curator would be able to evaluate the ability to accurately represent, for example, a newly acquired dataset, or evaluate the accuracy and precision of their ontological knowledge-representation as their underlying dataset grows and changes over time – a common scenario in the life sciences!


This work is part of the CardioSHARE initiative, founded through a Special Initiatives Award from the Heart and Stroke Foundation of British Columbia and Yukon, with subsequent funding from Microsoft Research and an Operating Grant from the Canadian Institutes for Health Research (CIHR). BMG is funded by the NIH Gene Wiki grant (GM-089820).


Acierno J. L. (1994) The history of cardiology: Men, ideas and contributions, Informa Health Care, 758pp.

Bodenreider O. (2008) Biomedical Ontologies in Action: Role in Knowledge Management, Data Integration and Decision Support,
Yearbook of Medical Informatics, 67–79.

Buitelaar P., Magnini B. (2005) Ontology learning from text: Methods, evaluation and applications, Vol. 123, Frontiers in Artificial Intelligence and Applications, IOS Press.

Good B. M. (2009) Strategies for amassing, characterizing, and applying third-party metadata in bioinformatics, Ph.D. Thesis.

Lehmann J., Auer S., Bühmann L. et al. (2011) Class expression learning for ontology engineering. Web Semantics: Science, Services and Agents on the World Wide Web
9:1, 71-81.

Lincoln M. J., Brown S. H., Nguyen V., et al. (2004) U.S. Department of Veterans Affairs Enterprise Reference Terminology strategic overview, Studies in Health and Technology Informatics, 107, 391-395.

Peters L. and Bodenreider O. (2008) Using the RxNorm web services API for quality assurance purposes, AMIA Annual Symposium Proceedings, 591-595.

Ruff C. T., Braunwald E. (2010) Will Warfarin ever be replaced, Cardiovascular Pharmacology and Therapeutics, 15, 210-219

Wishart S. D. et al.
(2007) DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Research, 36 (suppl. 1): D901-D906.

Witten H. I., Eibe F., and Mark A. H. (2011) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3rd edition.


*To whom correspondence should be addressed.

1 ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NDF-RT/

2 http://rxnav.nlm.nih.gov/

3 http://www.drugs.com/condition/diabetes-mellitus-type-ii.html

4 http://www.drugs.com/members_comments_add.php?ddc_id=3280&brand_name_id=0&condition_id=432

No Comments

Leave a comment