Automated Annotation of Abstracts for Cognitive Experiments

1 Motivation

This work explores automatic annotation of fMRI studies based on standard terms from the Cognitive Paradigm Ontology (CogPO). We are using text-mining and stochastic modeling approaches on a corpus of neuroscience literature to determine their correct annotation.

BrainMap (www.brainmap.org), one of the largest neuroimaging studies databases, provides toolsets to perform meta-analyses of underlying brain functions in disorders like schizophrenia, bipolar disorder, depression, and autism. (Laird et al 2005). This is facilitated by the Cognitive Paradigm Ontology (CogPo), a taxonomy-based schema, which helps link similar studies in terms of experimental questions addressed, imaging methods used, behavioral conditions in play, and statistical contrasts performed, despite the use of differing vocabularies (Turner & Laird 2012). The current bottleneck in incorporating a significant fraction of the existing literature in the ontology is the time and effort requirement of curating each paper by hand.

We present automated techniques to complement the human annotation process. Existing automated annotation methods don’t attempt to recreate an expert’s annotations on a curated dataset in application of ontological terms.

Authors

Chayan Chakrabarti 1, George F. Luger 1, Angie R. Laird 2, and Jessica A. Turner 3*

1 Department of Computer Science, University of New Mexico, Albuquerque, NM.

2 Department of Radiology, University of Texas Health Sciences Center, San Antonio, TX.

3 Mind Research Network, Albuquerque, NM.

2 Text mining approach

Our corpus contains 327 paper abstracts, related to fMRI attention studies, each annotated with an associated stimulant, response to stimulant, and instructions. We created a 2,241-word dictionary of all discriminative words in the corpus, by keeping words present in standard English-language, and medical dictionaries, and removing standard stop words. We used the Porter stemming algorithm to reduce each word to its root form (Porter 1980). Each abstract is then encoded with a 2,241-word vector, where each element represents the frequency of the corresponding word in the abstract, and then plotted in a 2,241-dimensional space.

We hypothesize that abstracts in close proximity will have similar content and hence be annotated with the same terms. We use k-nearest neighbor (kNN), a non-parametric lazy learning algorithm, to determine proximity in terms of Euclidean distance (Duda & Hart, 2001). A random 2:1 split served as our training and generalization set, and we experimented with k = 1, 5, 10, 20, 50, and 100. The predicted annotations were compared against actual annotations, and the results were averaged over 10,000 independent runs.

Table 1: Results of K-nearest neighbor algorithm for automatic annotation of point-abstracts using 2-fold cross-validation.

Value of K

Annotation Accuracy

1

29.67%

5

43.61%

10

52.94%

20

53.11%

50

47.22%

100

30.06%

We restricted our experiments to the 5 most common stimulus types that have a frequency greater than 15 in the corpus: Words, Shapes, Pictures, Letters, and Digits. The performance of kNN was better than random guessing, and hence this method can be used to guide the human expert.

3 Stochastic approach

We approximate the decision making process of a human expert hand curating a set of abstracts from our corpus. We conducted a live annotation session, where the human expert talks us through her decision making process. We observe that the expert looks out for discriminative words, which give clues to the class of stimulus, and response annotations. This suggests that the expert has a hierarchical decision making process, where observation of certain keywords guides her to seek further discriminative keywords which successively narrows her search to the correct annotation. We are modeling this process using a stochastic decision tree, where the class of annotation keywords at each level of hierarchy is determined by a naïve-Bayes classification.

References

Duda & Hart. Pattern Classification. (2001) 2nd Edition. Wiley-Interscience.

Laird AR, Lancaster JL, Fox PT. (2005) BrainMap: the social evolution of a human brain mapping database. Neuroinformatics. 3(1):65-78.

Turner JA, Laird AR. (2012) The cognitive paradigm ontology: design and application. Neuroinformatics. 10(1):57-66.

Porter, M.F. (1980) An algorithm for suffix stripping, Program, 14(3) :130-137.