BioSearch08: Program and Abstracts

Links

Program

Participants with Poster Only


Abstracts

Keynote Presentation 1

Dina Demner-Fushman

Information retrieval and Natural Language Processing for Clinical Decision Support

Information retrieval and natural language processing methods are instrumental in enhancing healthcare by providing clinicians, patients and other involved individuals with knowledge and person-specific information presented at appropriate times. Some of the specific challenges of Clinical Decision Support (CDS) are: using free-text information to drive CDS, representing clinical knowledge and CDS interventions in standardized formats, and leveraging the data available in Electronic Health Records (EHRs), which often contain narrative healthcare data.

This talk will present research on several aspects of the CDS challenges: developing strategies for automatic question and query formulation using information extracted from clinical narratives; finding adequate evidence and extracting answers to clinical and translational research questions; and retrieving images to illustrate evidence.

Dr. Dina Demner-Fushman is a Staff Scientist for the Communications Engineering Branch at the National Library of Medicine. She conducts research in clinical decision support, clinical question answering, use of natural language processing in information retrieval, human computer interaction aspects of information retrieval, and information retrieval in biomedical domain. Her interest in biomedical language processing stems from years of clinical practice (M.D. obtained from Kazan State Medical Institute in 1980) and clinical research (Doctorate (Ph.D.) in Medical Science earned from Moscow Medical and Stomatological Institute in 1989.) She earned her MS and PhD in Computer Science from the University of Maryland, College Park in 2003 and 2006, respectively.

Short Presentations - Session 1

Alexander Krumpholz

Improving age-specific PubMed search

PubMed allows age related search via a filter, which lets the user specify age-related Medical Subject Heading (MeSH) terms. The Medline abstracts often contain age-related terms that specify a narrower age- range than the Mesh terms available. For example the closest matching age-related MeSH term for a 104 year old person is "aged, 80 and over". Over 250 publications contain the term "centenarian" - a person between 100 and 109 years of age.

In order to retrieve publications that best match the case of a particular patient, we use the patient's age and map it to its related age-terms in analogy to fuzzy membership functions. We use the degree of membership as term weights to rank the closer matching publications higher.

Jon Patrick

Question Answering from Clinical Information Systems

Clinical Information Systems typically have no search capability over the narrative notes that staff write about their patients. This creates the opportunity to invent useful language technologies that serve their various needs for both daily operational work and for their research. The notes collected in the course of the care of patient are important to the on-going care of the patient. Each day clinical staff need to ask questions of the patient record so that they can check the care administered by other staff. Resolving such questions has to run the gamut of poor spelling, poor grammar, technical terminology with variable morphology and phrasal structure, neologisms, and homespun abbreviations & mnemonics. Current research has concentrated on two related technologies, firstly producing information extraction for the routine process of daily care by a Smart Notes tool. Second, by developing a clinical data analytics language CliniDAL which is intended to allow expression of all questions that can be answered from the contents of the clinical database, and to compute the answers to those questions. It has the following features:

  • Question formation using a controlled natural language,
  • Allowable usage of local sociolect terminology,
  • Retrieval of all components of the clinical record and search of text components in any combination,
  • Formation and evaluation of statistical hypotheses using any retreivable content.

Stephen Wan

The Information Needs of Academic Researchers in the Wild: A Preliminary Study

With the near exponential growth in the available academic literature, staying up-to-date with the latest advances in research is a challenging task. In this paper, we describe the design and development of smarter tools to support researchers in navigating the literature and deciding whether something is relevant or not. Our work is directed by a preliminary user study which ascertains the information needs of academic researchers (primarily in the bio-medical science domain) as they read a publication. We are interested in kinds of information needs that users self-report when engaging in tasks that require further reading of bio-medical literature. We use the results of a preliminary user study to develop a next generation Aggregated Search system which includes query- and task-focused automatic text summarisation capabilities. In this paper, we present our analysis of the user study and outline the design implications for our research and development.

Wei Liu

Ontology, Text Mining and a Proposed Application in BioInformatics

Most current ontology management systems concentrate on detecting usage-driven changes and representing changes formally in order to maintain the consistency. In this work, we present a semi-automatic approach for measuring and visualising data-driven changes through ontology learning. Terms are first generated using text mining techniques using an ontology learning module, and then classified automatically into clusters. The clusters are then manually named, which is the only manual process in this system. Each cluster is considered as a sub-concept of the root concept, and thus one dimension of the feature space describing the root concept. The changes of terms in each cluster contributes to the change of the root concept. Using our system, Web documents are collected at different time periods and fed into the system to generate different versions of the same ontology for each time period. The paper presents several ways of visualising and analysing the changes. Initial experiments on online media data have demonstrated the promising capabilities of our system. BioInformatics is a much better domain in terms of detecting changes in concepts as it is highly dynamic. The purpose of attending this workshop is to learn more about the BioInformatics domain and see how our work can be applied.

Student Presentations - Session 1

Pooyan Asgari

Identifying for concepts in a noise prone environment: Looking up Obesity and its 15 co-morbidities In patient discharged summaries

Developing a tool for identifying clinical terms and concepts within a noise prone collection of clinical notes has its own requirements and issues. The specific nature of a noisy data collection raises at least two major issues. The first issue comes from a scattered matrix of evidence for a specific concept in which however has many common thereby confounding attributes with other concepts. Considering more features or patterns with the hope of covering more rare situations may lead to the absorption of more noise by the system and impact the identification of other major terms and concepts, and therefore the overall performance of the system. The second issue comes from the nature of the data collection and the necessary process for gathering evidence about existence/absent of a specific concept. Assuming 4 possible answers for a search concept namely Exists/Not Exists/Questionable/Unmentioned, biases the decision algorithm towards the two more frequent classes: Unmentioned and Exists which have unlike characteristics. The Unmentioned label has to be identified based on lack of evidence for a given search concept while an Exists label should only be assigned in presence of clear indication of a given concept. Adding more features to the feature list in the machine learner leads the system to a better and more confident classification for Exists class but at same time may lead to inaccurate results for the Unmentioned answer due to an increase in the level of the noise. We designed a customized system to address the common challenge of both issues, which is Noise reduction. Using a mixture of rules, different techniques in language processing algorithms, a decision tree classifier and some innovative solutions, a system was developed specifically for these types of noise prone corpora. We kept the number of features to monitor as low as possible based on the proposition that concepts are best defined in a few features and many features would add noise to the classifier. In a second stage, an effective noise reduction algorithm which filtered suspicious noisy features was applied to the dataset to suppress possible noise. The primary goal was to evaluate a proposed approach for processing a collection of 724 discharge summaries with a noise prone nature. Evaluation has been done against given human performance as a gold standard with precision and recall of 0.969 and 0.969 respectively.

Andrew MacKinlay

Information Extraction over Diverse Domains using Deep Parsing Techniques

We present a preliminary system for evaluating semantic similarity of documents using machine learning techniques over diverse genres -- specifically online technical forums and biomedical abstracts. The gold-standard similarity judgements are in some cases hand-annotated but we also present an automated method for determining semantic similarity over the GENIA event annotation document set. The present system uses fairly naive feature vectors based on applying transformations on the bag-of-words statistics for the documents inspired by well-known metrics such as TF-IDF and skew divergence, but we plan to add more sophisticated features based on domain-specific named entity recognition as well as the outputs of shallow and deep parsing.

Juana Maria Ruiz-Martinez

Learning non-taxonomic relationships in Biomedical Domain

Semantic technologies are becoming more and more important in biomedical domains. Ontologies provide vocabulary standardization, allowing for reasoning mechanisms and supporting semantic interoperability issues between computer systems or between experts and computer systems, which are basic for tackling the problem of information overload in biomedicine. However, the construction and the update of biomedical ontologies is a problematic issue, since it is a time and resource consuming task. In this sense, Textual Knowledge Acquisition from electronically accessible bio-literature has become an important application area in order to create and manage biomedical ontologies automatically (ontology learning). However, an important drawback of most existing approaches is that they are only capable for extracting taxonomies or a very reduced set of relations. With the aim of overcoming these limitations a set of semantic relationships compatibles with OBO (Open Biomedical Ontologies) has been proposed. By means of information extraction techniques noun phrase candidates which can form part of a relationships are identified. Verbs, which are considered in this approach the key to identify non-taxonomic relationships between concepts, are also identified. This can be combined with a MCRDR (Multiple Classification Ripple Down Rules) module by which new relationships are proposed automatically according to the stored relationships. This module could be a semi-automatic aid of validation of the acquired relationships and candidates by an expert.

Mojtaba Sabbagh-Jafari

Automated De-identification of the Clinical Documents

Removing protected health information (PHI) from clinical documents is a required task and should be done before clinical documents can be used for research or other text processing systems. If this process is performed manually, it is tedious and prone to error, therefore computer support is valuable.

The purpose of this system is to find PHI objects in the free clinical texts and replace them with proper surrogate information, in order to retain their interpretability and usefulness for research. To develop the de-identification approach, the system uses several gazetteers, regular expressions for pattern matching, heuristic rules and local context features. These lists of words are proper names, medical terminology, common English words and list of locations used to find PHIs. In many cases there is ambiguity between PHI and non-PHI elements as well as some foreign names or mispelt words which cannot be recognized. In these cases local context features and heuristic rules help this system to classify correctly. POS and syntactic bigram as local context features are extracted from words which are located in a window of one or two word from the target word.

Keynote Presentation 2

Limsoon Wong

Guilt by Association as a Search Principle

The exploitation of fundamental invariants is among the most elegant solutions to many computational problems in a wide variety of domains. One of the more powerful approaches to exploit invariants is the principle of "guilt by association". In particular, the principle of guilt by association is the foundation of remote homolog detection, protein function prediction, disease subtype diagnosis, treatment plan prognosis, and other challenges in computational biology. The principle suggests that two entities are in a specific relationship if they exhibit invariant properties underlying that relationship. For example, a protein is predicted to have a particular biological function if it exhibits the underlying invariant properties of that functional group --- viz., guilty by association to other members of that functional group through the shared invariant properties.

In my talk, I plan to present several facets of guilt by association in the computational prediction of protein function and draw parallels of these facets in information retrieval. Specifically, I plan to touch on the following facets: (a) the issue of chance associations; (b) novel generalizable forms of association; (c) fusion of multiple heterogeneous sources of evidence; (d) the dichotomy of knowing to a high degree of reliability that two entities are in some relationship and yet not knowing what that relationship is. I hope this talk will be, for the informational retrieval community, a window to the opportunities in computational biology that may benefit from the depth and variety of solutions information retrieval has to offer.

Limsoon Wong is Professor and Head of Computer Science and Professor of Pathology at the National University of Singapore. He currently works mostly on knowledge discovery technologies and is especially interested in their application to biomedicine. He has written about 150 research papers, a few of which are among the best cited of their respective fields. He serves on the editorial boards of Journal of Bioinformatics and Computational Biology (ICP), Bioinformatics (OUP), and Drug Discovery Today (Elsevier). He is chairman of Molecular Connections and scientific advisor to CellSafe International. Limsoon received his BSc(Eng) from Imperial College London and his PhD from University of Pennsylvania.

Short Presentations - Session 2

Peter Ansell

Bio2RDF: Providing named entity based search with a common biological database naming scheme

The Bio2RDF project provides effective cross-database biomedical search functionality through the use of a common representation format, RDF, and common query mechanism, HTTP. Although we provide mostly biological databases, we also provide dbpedia, the RDF form of Wikipedia, and, in the future, wordnet, to provide for complete throughput from vocabularies to biological databases. We focus on the linked database aspects, although basic text-searches are supported. Text-mining on datasources which are included in Bio2RDF, such as PubMed, can be used together with current knowledge about the links between biological databases to both enrich the text-mining process and to make the text search results applicable in a larger context. In addition, the results of dynamic searches can be stored and tagged, to be included as part of a dynamic datasource for future Bio2RDF users.

Peter Budd

A Taxonomy of Terminology Server Desiderata

The use of terminology servers in the health domain will lead to the standardisation of the organisation of medical thesauri, terminologies, ontologies and classifications (TTOCs). By using mappings between TTOCs, users will be able to search the semantic content of medical files using their own TTOC, and still have the meaning of their search terms preserved across TTOCs and by implication across the clinical information systems that use the TTOCs.

The actual use of terminology servers in the field however is sporadic at best and the functionality is usually implemented within an individual clinical information system, leading to inconsistency in record keeping and data representations. This research canvases the literature from more than a decade of research into the problem. Our research defines the role of a terminology server and details the desiderata for the use of a terminology server. The limit of these desiderata are discussed and a functional taxonomy is produced that specifies the features a terminology server must possess to provide for indexation, storage and retrieval of medical concepts based on semantic rather than lexical features. A prototype implementation of a terminology server built on these desiderata has been produced by the Health Information Technology Research Laboratory at the University of Sydney and currently serves numerous applications, including; the GCIMS project, a generic ontology viewer, a ward round information system, a clinical data analytics engine, and an automated medical concept identification engine for use on text.

Sarvnaz Karimi

Ranked Search for Medical Systematic Reviews

Searching and selecting articles to be included in systematic reviews is a real challenge for healthcare agencies responsible for publishing these reviews. The current practice of manually reviewing all papers returned by complex hand-crafted boolean queries is human labour-intensive and difficult to maintain. We demonstrate a searching system that takes advantage of ranked queries to assist in the retrieval of relevant articles, and to restrict results to higher-quality documents.

David Martinez

Using Ranked Search Strategies in Combination with Supervised Text Classification

One of the goals of the project BioTALA (Biomedical Text And Language Applications), is to address the search needs for building systematic clinical reviews for medicine, an increasingly growing area that can benefit the way medical treatments are applied throughout the world. This problem is specially difficult to solve with standard search strategies, because of the very high recall required for the medical research questions. This results in complex boolean queries that are time-consuming to produce and difficult to maintain. Our approach is to rely on ranked search strategies in combination with supervised text classification. We present our initial results over systematic reviews from the Agency for Healthcare Research and Quality (AHRQ), showing that our system can significantly contribute to the state of the art.

Student Presentations - Session 2

M. Asif Khawaja and Fang Chen

Analysis of Bushfire Personnel’s Speech Transcriptions for Linguistic Cues of Cognitive Load

In complex, time-critical and data-intense situations users of a system can experience extremely high cognitive demands imposed on their limited working memory which can interfere with their ability to perform and complete the task at hand efficiently. Intelligent adaptive user interface systems which are aware of the users’ current level of cognitive load could in fact, alleviate these problems by implementing strategies to adjust the behavior, support, user interaction material, and resources needed as per users’ current cognitive burden to help them complete the task effectively.

Our study presents a speech content analysis approach to the measurement of cognitive load which employs users’ linguistic features of speech to determine their experienced level of cognitive load. We present the detailed analyses of several linguistic features extracted from the live speech data collected from the subjects, the members of a bushfire incident management team, involved in highly time-critical and data-intense bushfire management tasks around Australia. We discuss the results for nine selected linguistic features showing significant differences between the speech from the low load tasks and the high load tasks.

Despite the fact that the study focuses on bushfire operators’ speech transcriptions, we believe that the proposed method can be used with any clinical or medical transcriptions of patients’ speech for the purpose of cognitive load measurement of those patients in order to help the clinicians and/or doctors better understand the mental state of the patients.

Stefan Pohl

Query Processing in Biomedical Search

Query processing becomes costly when large collections are involved, or long, and complex queries are to be answered. Biomedical search has to deal with both, because high recall requirements lead to long, expanded queries and medical publication archives are ever growing. Recent trends in computer architecture are ambivalent: In 64-bit architectures, high amounts of memory become available so that more data can readily be held in main-memory. This shifts query processing costs from being dominated by disk to memory accesses and computation. At the same time, processors stopped becoming faster. Instead, more of them are suddenly available, and new ways have to be found to use them efficiently in order to reduce query processing times.

Willy Yap

Relation extraction for biomedical text

Relation extraction is a sub-task of Information Extraction (IE) that is concerned with extracting semantic relations between word pairs based on corpus data. Past work on relation extraction has concentrated on creating a small set of patterns that are good indicators of whether a given word pair contains a semantic relation. In recent years, there has been work on using machine learning to automatically learn these patterns from English corpus text. We build on this research in applying a generic relation extraction algorithm to the biomedical domain. However, instead of extracting word pairs with semantic relations as already been done to English corpus, we are interested in extracting the interaction between proteins in biomedical documents.

Sun Xiaoxun

Toward Privacy Preserving Microdata Publication

High quality and useful knowledge is to be found in the integrated data from various organizations, and the discovered knowledge is essential for building intelligent systems such as business analysis and health surveillance. However, concern about breaching privacy is a major obstacle of this process. This project aims to develop new efficient and effective techniques for privacy protection in data sharing and data mining by combining techniques in data mining and security research. We focus primarily on notions of anonymity that are defined with respect to individual identity, or with respect to the value of a sensitive attribute. Our goal is to propose a variety of techniques to anonymize original data sets, while preserving the utility of the input data. We adopt extensive evaluations to indicate that it is possible to distribute high-quality data that respects several meaningful notions of privacy. Further, it is possible to do this efficiently for large transactional data sets. The developed cutting edge techniques will advance and facilitate data mining within many organizations and businesses and lead to the better utilization of information.

Posters Only

Stefan Schaefer

Context Analysis in Clinical Environments using Natural Language Processsing

Medical records comprise of a variety of detailed documents written in natural language such as clinical case studies, patient profiles and treatment reports. Extracting the clinical information from these documents is crucial as this makes clinical data fit for automised processing. This poster introduces a new concept of clinical contexts and proposes a new approach to context analysis in clinical environments which allows information retrieval from clinical documents.

Yeondae Kwon

A Proposal of a Ranking Method Based on Specificity of Biological Terms

There are a lot of interests on extracting associations between diseases and genes from literatures such as MEDLINE abstracts. For a given disease, a search engine returns a ranked list of candidate genes according to some criterion. In this research, we propose a ranking algorithm that focuses on specificity to a particular disease. A specificity-based ranking method should rank a gene that causes a given disease but does not cause other diseases at the top. This is important for drug developments because users can find relevant genes that do not have side effects quickly. We describe a specificity-based baseline algorithm using term dictionaries and co-occurrence data of terms in MEDLINE abstracts and discuss future directions.

Anthony Nguyen

Cancer Stage Classification from Free Text Medical Reports using Ontologies and Machine Learning

Cancer staging is the process of classifying the extent of the primary tumour and metastatic spread to other parts of the body using the TNM (Tumour-Nodes-Metastasis) standard. This process is conducted through a multidisciplinary team (MDT) conference, which is time and resource-intensive. As a result stage data is not routinely collected. Tools to retrospectively collect stage data are therefore needed to fill in gaps from their cancer stage collection efforts.

This poster presents the use of SNOMED CT or UMLS SPECIALIST Lexicon ontologies and Support Vector Machines (SVM) for the automatic classification of cancer stages from free text medical reports. Preliminary experiments on the classification of a clinical M (Metastasis) stage for lung cancer patients by analysing their free text radiology reports have achieved promising results with sensitivity-specificity (SS) break-even points of approximately 0.89, area under the SS curves of 0.95, and precisions of approximately 0.70.

Yefeng Wang

Extracting and representing clinical knowledge using SNOMED CT

Automatic indexing of clinical concepts in free text patient records using a standard medical terminology will enhance semantic retrieval, which can then be used for important applications such as decision support and disease outbreak detection. SNOMED CT is a rich terminology that provides standardisation of knowledge and language in the clinical domain. Two important challenges are identification of the concepts in clinical reports and then using the identified concepts to construct an integrated representation of the patient case. Although most clinical words found in the patient notes are present in the terminology, the rich set of relationships between the words and concepts cannot be fully represented. Lexical and concept verification is error prone due to the variance of the clinical language used in different departments in hospitals and the ungrammatical nature of the narrative reports. To integrate the recognised concepts extensions need to be made to the ontology or it needs to be placed in a wider ontological model to fully represent all matters relevant to the patient case.

This research aims to address the concept extraction and concept representation issues, by classifying medical concepts into the 17 SNOMED CT semantic categories, and representing their relationships using SNOMED CT 60+ relationship categories. The experiments will be conducted on a subset of a 44 million token Intensive Care corpus from the Royal Prince Alfred Hospital, Sydney. Through the classification, an extended ontology will be built to represent the ICU terms for use in data retrieval activities.