Textual Entailment Recognition is the task of deciding, given two text
fragments, whether the meaning of one text is entailed from another text.
Thus we are testing up to which level NLP systems could claim to
"understand" language by designing systems that can cope with textual
inference. My current interest in the field - a subtask of the work I am
doing -- is identification and classification of inference mechanisms lying
in the basis of proving of the entailment relation between text fragments.
The main goal is to built the system dealing with textual inference on the
level exceeded the state-of-art level now.
Textual entailment is believed to be relevant for Question Answering,
Information Retrieval, and Information Extraction tasks in general. My
intuition is that it is relevant for any application aiming at particular
types of information to be found in restricted amount of resources, or
matched to each other. It can be used in QA systems when the assumption that
the answer will be presented in a simple form somewhere in the text did not
actually work and a more complicated answer search has to be done. It might
be useful in dialog or tutoring systems where easy ways of checking if a
user's answer on a particular question matches the expected answer are not
possible.
Research interests pertinent to the workshop theme:
The primary research area I am interested in are data clustering and clustering validation. My recent paper (accepted in Australian AI Conference) discusses a novel technique which provide intuitive clustering similarity values over the traditional methods. Moreover, the technique is suitable for stream data clustering where clusters may not have overlapping regions.
Another main focus of my research is retrieving an alternate clustering given an original clustering. This arises from the fact that there could be several clusterings present within data and current clustering techniques ignore this fact.
Document clustering has been an important component in IR systems over the years and has proved to be effective in helping users to find relevant information more efficiently. My research could add more valuable information to the current state of document clustering and help building the next generation IR system.
Thoughts on issues deemed to be important and potential points of
interaction with other disciplines:
I believe the next generation IR system must incorporate more accurate organization of information instead of a plain ranked-list of documents. This makes the cluster analysis task an extremely valuable tool. However, because clustering suffers from the subjective notion of "similarity" and highly dependent on the clustering criterion, investigating novel methods to generate highly informative clusters of documents are critical issues.
Consequently, this requires tighter involvement of NLP, IR and data
mining where techniques from NLP helps IR systems to
retrieve more relevant documents and subsequently helping
data mining tasks (i.e. clustering) to generate more
effective sets of clusters.
Research interests: language technology, web and text mining, machine learning
Issues: cross-document indexing, language identification/segmentation in web
documents, information extraction from web documents
Bodo Billerbeck (Sensis), Andrew Turpin and Falk Scholer (RMIT) [presentation slides]
Web search engines typically present search results based on the user
query, taking into consideration statistics such as document and term
frequencies. Other statistics that give an indication of the popularity
and content of documents, including anchor text and page rank, are also
of importance.
More expressively collaborative mechanisms are emerging now.
Assuming no other clues about the user intent are available, ideally an
editor would rank all pages in response to the user query, judging the
relevance of each document.
Similar useful would be the judgements by other users that have issued
the same query previously; however, users are typically reluctant to
provide a judgment of relevance.
Implicitly, this information can be deduced from the clicking behaviour
of earlier users, collected during their interaction with the search
engine. By now, most engines have collected several years worth of query
logs and click through information.
How can feedback of this type be made use of in order to attribute
general document quality scores, or to re-rank results according to a
particular query?
Many factors other than relevance contribute to a user clicking on a
particular result, for instance a misleading document summary and the
general habit of clicking on higher ranked results rather than those
lower in the rankings.
How can reliably be determined whether a page is of high quality (either
in response to a query, or just by itself) according to the clicks?
Next Geneeration Search from and Information Ecologial Perspective
At the outset, our position repudiates the view that Google has solved the search problem. In the not to distant future, our information environment will be made even more complex by all sorts of information processing and display devices combined with the ongoing information explosion. In parallel, the object of search will transcend individual documents as the distinction between structured (databases) and unstructured information blurs even further. For example, the object of search maybe the discovery of a meaningful connection between concepts, or a collection of online services to close an agenda, e.g., to open a coffee shop in Brisbane. In this environment, queries, the triggers for search, will be increasingly tacit, in contrast to the explicit queries commonly formulated today. The context of the search will need to be taken into account, mainly because context provides the means to effectively filter relevant information from that which is not. As a consequence, current search technology, which essentially only matches documents with query representations will need to be endowed with an inferential capability, for example to infer tacit queries from a given situation, draw appropriate context sensitive associations in order to support exploratory search behaviour which comprises a mixture of serendipity, learning and investigation.
It is important to note that the inferred context sensitive associations should accord to a large degree with those we would make. This suggests operational socio-cognitive semantics - the technology should manipulate meanings that accord with those we harbour.
The term "ecological" suggests a holistic approach to how people, information and context interact in relation to the search task, whether explicit, or tacit. This stance is in direct contrast to the current bifurcation between user and information space, with context, including user and task models, largely ignored. A serious candidate for furnishing operational forms of socio-cognitive knowledge representation is semantic space models. These have an encouraging track record of replicating humans across a variety of information processing tasks. Thus far, such models have mainly been investigated within cognitive science. The next generation of search should look carefully at these models and see how they can exploited. A highly speculative line of investigation would pursue the recent discovery connecting semantic space models and quantum mechanics (QM). This offers the possibility to build on some interesting new theory relating information retrieval (IR) and QM by a leading IR theorist Keith van Rijsbergen. Pursuing the QM line of investigation not involves the discovery of a radically new class of models for search, but also allows us to question our philosophical positioning which is almost exclusively realist.
Intelligent Text Processing at Macquarie's Centre for Language Technology
Macquarie's CLT engages in a range of activities in areas relevant to Next
Generation Search, and others will report on some of these. In my
presentation I'll give an overview of some activities we have been
carrying out in the general area of intelligent text processing: in
particular, our view is that next generation search systems will benefit
from indexing that is based in linguistically-motivated information
extraction, rather than treating documents simply as bags of words.
I'll say something about each of the following projects:
in collaboration with the Capital Markets Cooperative Research Centre,
we have been pursuing the development of a suite of information extraction
and text summarisation tools in the GainSpring project;
in work for the DSTO, we have been exploring temporal expression
recognition and normalisation, and cross-document entity tracking;
and
in a new pilot project collaborative with UNSW, we are focussing on
sentiment analysis in newspaper headlines.
We're keen to find ways of integrating what we are doing with the work of
others, particularly those working in information retrieval.
The wide acceptance and rapidly growing use of XML as a standard
storage and retrieval data format blurs the historical divide that
exists between collections that are used in Information Retrieval
on one side and in Database Retrieval on the other.
While most information retrieval systems operate at the document or
passage retrieval level, it has become possible with XML marked up
collections to take advantage of the rich semantic information
that is embedded in the documents themselves. It is possible to
specify in queries which elements are of interest and it is possible
to return XML elements. However, traditional IR is still working with
query models that do not utilize structural retrieval cues that users
are able to provide. We propose to extend the natural language query
model (NLQ) and to support natural language queries that not only specify
content requirements with respect to an information need, but also
specify structural requirements. This extended NLQ model lends itself to immediate
application of NLP techniques. We describe a the results of early attempts
to support NLQ for XML information retrieval.
Relevance, salience and emergence: perspectives on information
retrieval in pragmatics
My main research interests lie in the field of pragmatics and intercultural communication. Drawing from conversation and discursive analysis, I take a broadly social contructivist view in analyzing pragmatic phenomena.
In relation to this workshop, my interest in applications of relevance theory, cross-cultural rhetoric and information flow, and the co-constitution of intentions through interaction are three areas which may potentially have some relevance to the issue of developing more efficient information retrieval systems. In other words, three issues about which research in pragmatics might possibly have something to contribute are: (1) defining what is deemed "relevant" in retrieving information; (2) analyzing how the flow of information in texts is structured to give prominence or salience to certain elements; (3) understanding how the intentions underlying information searches may be interactively achieved (i.e. "emergent").
We envisage a future in which search is characterised not only by new
and improved technogies but by the ability to conduct
context-sensitive searches over multiple heterogeneous collections,
including private, corporate, public and subscription sources.
Standard IR evaluation questions become deeply challenging in this
context: How much benefit is actually derived from this new technique?
Does this great idea actually make any difference in practice? Is
system A better than system B? Well known evaluation techniques based
on test collections and human experimentation in the laboratory are
inadequate or limited in various ways.
In response, we have proposed a simple two-panel evaluation tool which
takes the place of a person's ordinary search interface and presents
two alternative sets of results side-by-side, randomised for left and
right. The person is invited to indicate "prefer left", "prefer right"
or "no difference" and may be asked for additional information through
unobtrusive questioning. We can demonstrate the tool, report results
obtained during validation of the approach and discuss its strengths
and weaknesses.
My broad research interests of relevance are in the following areas:
Digital libraries: particularly search services for distributed
data environments and metadata standards
Web data mining: particularly large scale semi-structured data
acquisition and information retrieval engines
Particularly I am interested in the intersection of generic search
technologies and domain-specific highly structured, such as linguistic
data on the web; and in the application of broad coverage
classification techniques (eg for language identification of
documents) to web data as a catalyst for higher order domain specific
search applications.
Points of Intersection / Debate :
Despite decades of research in information retrieval, and the
emergence of more generally accessible interfaces for information
discovery such as web search engines, a reasonably standard output
display method still dominates: ranked lists. It is clear that a wide
variety of alternative display types for engaging in information
discovery tasks are available, but these are not actively deployed.
Evidence from cognitive science research shows that different modes of
information engagement result in different information outcomes, yet
there is comparatively little research in how knowledge about human
communication preferences can be brought to bear in the web search
context, particular in results display and user interaction.
In the linguistic domain, fine grained semantic distinctions can
have profound effects on interpretation of communicative intent; a
single utterance or piece of text can have many different
interpretations depending on the context it is presented in, and the
perspective from which the consumer approaches it. This semantic
ambiguity contrasts markedly with the granularities typically adopted
in information retrieval when considering such pseudo tasks as "Is
this document relevant to a given query ? Yes or No". While some
research has been conducted into graduated assessments which allow for
greater human interpretation of the concept of 'relevance' between a
query and a document set, there remains a large number of open
questions, in both science and engineering terms, as to how to
effectively provision for semantic ambiguity in determining
relevance.
Machine transliteration, which deals with out-of-vocabulary (OOV) terms including proper names and technical terms, is the main focus of this research. We are interested in transliterating terms between English and Persian in particular, and may extend that to languages with similar scripts, such as Urdu. Therefore, this leads the work on transliteration methods and parameters affecting this process, namely, corpus construction related issues and character alignment difficulties which both affect the effectiveness. Also, apart from generative transliteration which generates terms based on previously seen word pairs, discovery of transliteration pairs out of existing parallel documents is a helpful method which will be covered in this research.
Statistical methods are widely used in transliteration where it could be more viewed from linguistics aspects which are directly related to this task due to the habit of people speaking in involved languages. In addition, any multilingual information retrieval and question answering system, and particularly machine translation applications which may need to handle OOV terms, can benefit from adapting automatic transliteration techniques.
Multiword expressions(MWEs) are lexical items that can be decomposed into
multiple simplex words and display lexical, syntactic and semantic
idiosyncracies (e.g. apple pie, hand in, make a mistake, a piece of cake).
MWEs are used in many NLP application such as machine translation and IR.
For search engine, as multiple-word queries, MWEs are interesting issue.
The compositionality/decompositionality of MWEs can provide the variation
of given queries. For instance, with compositional MWEs as a query, we
have to consider the components of given MWEs as possible sub-queries
while non-compositional MWEs cannot be splited as sub-queries. Also, the
semantic relations in MWEs can narrow down the searching boundary. As a
perspective of query handling, it is worthwhile to study MWEs that can
provide an efficient queries to search engine.
My research focuses on the intelligent coordination of information retrieval,
aggregation and delivery, based on reasoning about the context of a user's
interaction. In particular, I'm very interested in communication patterns in
enterprise email as a means of inducing structure in such data. My current
research is focussed on applying speech act theory to email communication, as a
means to automatically identify the discourse structure within email
conversations. More broadly, I'm interested in the problem of efficiently
acquiring information that can be used to characterize conversation structure
in any form of textual human-to-human discourse. This structure can then
hopefully be exploited to allow more sophisticated (and context-sensitive)
search, presentation and summarization of such data.
There are clear points of interaction with many other disciplines, including:
Human-Computer Interaction: How can we usefully present information
about discourse structure to end-users?
Conversational and Discourse Analysis: What analyses can be applied to
understand the structure and patterns in text-based computer-mediated
communication? How can such analysis help facilitate more sophisticated
search and presentation of such data?
Information Retrieval: How might discourse structure be used to guide or
adapt the retrieval of relevant information?
Data Mining: What data mining tools and approaches can be used to
analyse and learn about discourse structures within large data sets?
Natural Language Processing: What NLP tools and techniques can be used
to help process textual human-to-human discourse in order to apply
discourse and conversational analysis techniques?
Multimedia Information Retrieval by Artificial Languages
This work deals with semantic retrieval of multimedia. In addition to
verbal (natural) language, information in a multimedia document is also
communicated by using non-verbal languages. For instance, to make sense
of a solitaire game video, the audience will need to know the set of
cards and the rules of how the game is played.
In this workshop, we show how retrieval of non-verbally expressed
information in a multimedia document is better performed by indexing
documents with the vocabulary elements of the non-verbal language and to
operate queries by using those vocabulary elements. In so doing, a great
number of semantic queries can be supported through post-indexing
coordination of the vocabulary elements.
My research involves response automation where a user expresses a
problem or inquiry that is longer and more involved than a typical
search-engine query, and therefore the required technology goes beyond
conventional question-answering systems. The particular application I
am currently investigating is an email-based helpdesk for
computer-related problems. Although the underlying problems that
trigger the inquiries to the help-desk revolve around a relatively
small set of issues, there is a high textual variability in the emails
due to the way that customers express themselves. Further, it is
often very difficult to pin-point what the actual query or question
is, due to the fact that customers provide background information that
obscures the question, and sometimes customers even omit the question
and expect it to be inferred from the background information. To make
things even more difficult, customers' emails are often ungrammatical
and poorly structured. Therefore, domains such as this one face an
important challenge of coming up with sophisticated representations
with which to build models of users' inquiries that lead to useful
models of response generation.
The SRI call should encourage researchers to investigate "user
queries" that go beyond the traditional ones. People are now well and
truly reliant on search technologies for addressing their
informational needs, and will start to expect such technologies to
deal with more complex inquiries, such as the email-based ones seen in
help-desks. Our research is showing that such technologies certainly
need to have a deep understanding of the users' inquiries, and further
- of the users' intent. This may involve building user models that
are based on demographical factors as well as cognitive ones such as
interest and expectation. This kind of research can therefore benefit
from interaction with fields such as cognitive science and HCI.
My main interest in this area lies in the processing of multi-document discourses (e.g. newsgroup-style data streams) for information delivery. These information sources present important challenges for simple term matching methods, and require new approaches that take into account the structure of the data. The new tools would perform a linguistic analysis of the texts and obtain a conceptual representation of the segmented data streams, linking them to a factoid-based summary that can be easily accessed by the user.
I am also interested in the application of semantic similarity measures in IR for improved recall. For this we would rely on Language Technology tools that provide paraphrase identification and word similarity scores. This knowledge could be integrated in a query-document similarity formula for better ranking of the documents.
A trend in information retrieval is towards moulding the search in
view of the individual user. Information about the individual - user
Context - is needed to do this. I have been working in the area of the
geometric representation of meaning derived from the analysis of
the textual communication of individuals. This interdisciplinary area
involves sociology, cognitive psychology, information science and
linguistics; the underlying theory is pragmatic socio-cognitive
semantics.
The user context derives from modelling context-sensitive associations
and
inferences that humans easily perform, and suggests associations in
context that
we would make were we not epistemically challenged. These associations
and
inferences may be generic, as well as specific to an individual or
community.
Such associations are necessary for exploratory search, which emphasises
serendipity, learning and investigation, as context about the user
is critical in this more uncertain search paradigm. The research has
potential
within the new areas of "social computing" and the "science of
identity".
Question answering is about accepting questions writen in plain
English and finding the answer by searching through unedited text
documents. Our project, AnswerFinder, combines various technologies to
identify the sentence containing the answer and extract the exact
answer. Currently we are studying the use of document retrieval,
question classification, and finding the answer to fact-based
questions, definition questions, and questions that require a summary
extracted from several documents.
A principal technique used in AnswerFinder is the representation of
questions and text sentences in a graph with concepts interrelated and
the use of machine-learning methods combined with Graph Theory to
determine if a sentence contains the answer and extract the answer. By
applying machine learning the system can be ported to other domains
and other languages. Furthermore, I am exploring the application of
these machine learning techniques at various levels of sentence
representation ranging from semantic networks to syntactic
dependencies to word neighbourhood.
As for a burning issue to handle in the future, I would mention
question answering for restricted domains. These domains may be rich
in lexical or ontological resources but they may not have enough data
to warrant the use of general (redundancy-based) methods.
My background is in cognitive science and natural language, and I am
moving into more concrete language technologies. My PhD thesis concerned
a study of language use in online journals (personal weblogs) as it
relates to the personality and gender of the author. This involved
working with language beyond mere function words or syntactic categories
and I used dictionaries of statistically-derived psychologically-related
definitions to explore individual relationships with, for example
Extroversion. I am interested in how we can use information about an
author to develop our understanding of a text and vice versa. In my
current work, I am to be looking more specifically at summarisation and
information extraction, using business reports as a data source. This may
be a far less personal genre of text, but similar principles can be
applied when looking to understand text.
Content analysis is a core approach within information extraction, but to
look beyond specific content into the language used can enrich work in the
area. Specifically, by looking at categories of words such as those
described above, by bringing more psychological meaning to language, we
can bring analysis a step closer to the way humans may approach language
processing. I believe my experience in this area, coupled with my desire
to be come more involved in the community surrounding my current field
provide me with much to both bring to and take away from this workshop.
Search has evolved considerably in recent years. From finding all
relevant information, through sorting by relevance, removing duplicates
and so on until we have an incredible amount of data to search through.
Searching by content is no longer enough to reduce the search space. The
semantic web and tagging are one approach. Another, the one I am most
interested in, is search by style, not just content. One application of
my Thesis work is to be able to search documents by the personality of the
author. When searching for film reviews, it might be nice to find out
what people with similar dispositions to yourself thought of a movie.
Market research looking for opinions of products might be more focused if
concentrated only on the weblogs of their key demographics for example.
In our work, we are looking into developing information systems capable of providing users with information appropriate for their needs and delivering it so that it is both understandable and useful to them. Clearly, search is an integral part of such systems, and, we in fact believe that these systems are the next-generation search technology.
Our approach so far to achieve this aim has been to combine information retrieval and natural language generation technologies. We now believe that it is not enough to simply put together a search engine and a discourse generation engine. Instead, these technologies must be truly integrated and enhanced with notions of context.
This brings a number of issues, including: how do we contextualise search behaviour, which implies asking questions of context definition and acquisition; how do we identify what aspect of contextual information is most useful to guide or constraint search and delivery (can we design various experiments to do this, and how?); which aspects of the resulting approach and system do we evaluate and how?
We believe these questions would benefit from being approached from different perspectives outside our own (NLP and LT), in particular: Information retrieval, human computer interaction, data mining and conversational and discourse analysis.
Luiz Augusto Pizzato
My PhD research focuses on defining a framework for information
retrieval that incorporates relational information in text
such as bigrams and syntactic dependencies, and the
relations between arguments and predicates. Since the
framework is able to incorporate many linguistic-oriented
features, its validation will be conducted using an IR
dependant task which is linguistically demanding, such as
Question Answering.
My primary research interest is extraction and computational analysis of citation information from corpora of academic papers. Analysis of citations can potentially tell us a variety of useful information about papers and the relationships between them. There are three levels on which we can analyse citations. Analysis of an individual citation and its context can tell us the reason for the citation, and therefore the relationship between the citing and cited work. Analysis of collections of citations from a particular work can provide an overview of how that work places itself with respect to other works. Analysis of collections of citations to a paper -- or in other words, what other researchers say about that paper -- can provide useful information about how other researchers use and view the work. Analysis of interdocument relationships using citations is a potential source of useful information on which to base tools for search and navigation of corpora of academic documents. In paricular, analysis of what other people say about a document may provide more useful search terms than anything in the document itself.
The important issues to my research centre around the sentences containing citations and their context in order to classify citation function, and how to use use citing sentences for applications such as document summarization and search. I am particularly interested in how citing sentences can be viewed as part of a discourse, and how dicourse analysis can offer insights into how we can determine the context and meaning of a citation. Similarly, I am interested in exploring the relationship between formal semantics and computational linguistics, and how we can feasibly extract and represent semantic information from citing sentences. More broadly, I would be interested in exploring how inter-document relationships could improve document navigation and search, and how citations and other reporting sentences could be the basis for such relationships.
Lexxe Search Engine - A 3rd Generation Internet Search Engine Powered by Advanced Natural Language Technology
The philosophy upon which Lexxe search engine is built is different to the traditional 2nd generation search engines, in that the object of information processing is language rather than symbols.
Given this different starting point, which marks the change of generation, a 3rd generation search engine has to understand natural language to certain extent. Even in the case of "Keyword" searches, a "one formula fits all" method adopted by most of the current search engines, including Google, has become very questionable. What is even worse is the introduction of "popular link" (PageRank) by Google, which simply proved the incompetence of such "one formula". The "popular link" factor itself is a very speculative approach towards information retrieval. However, the fundamental problem with the 2nd generation search method is the general approach of "Symbolic Computing", while the 3rd generation search technology employs a "Linguistic Computing" approach. Since webpages are mostly made up of language texts, we want to argue that it is most natural to assume that search is a linguistic computing activity.
Lexxe search engine has four major features using Natural Language technology: 1) Keyword-based search with phrase recognition; 2) Short Question Answering trying to find exact answers directly in the webpages; 3) Clusters generated on the fly offering themes and categorization of the result pages, and 4) Irrelevant pages screening.
Although Lexxe is still in its Alpha Version development, it has already attracted a considerable number of users through the showcase of its innovative technology. Many comments have been made on the Internet about Lexxe. They offer many insightful suggestions, which Lexxe will take very seriously in its further development.
Research interests: query log analysis; retrieval models; interactive
searching; evaluation of IR systems; query performance prediction.
Issues/points of interaction:
Evaluation of search systems: Ongoing research has demonstrated that
there is no correlation between the most widely-used IR performance
metrics and actual user search performance. New metrics that reflect
users and their search behaviour should therefore be developed. To do
this, the underlying assumptions of current IR evaluation need to be
re-visited. This could include the direct observation and analysis of
information-seeking behaviour in an online search environment, exploring
how users learn while conducting searches, identifying the different
types of search tasks that users actually engage in, and accounting for
factors such as interface design. This type of work would benefit from
the involvement of many different research areas, including the
cognitive sciences, HCI and IR.
Query performance prediction: If the effectiveness of a query can be
estimated in advance, a search system could respond dynamically to
situations where it is likely that a poor set of answers will be
returned. Most predictors use statistical information about the
distribution of terms within documents and a collection. Alternative
sources of evidence that could help to improve prediction might include
natural language and semantic features of query terms.
Collection and Document Summarization In Distributed Text Retrieval
Resources
In distributed information retrieval (DIR) systems, users search multiple collections
simultaneously by submitting their query to a single interface known as the broker.
In a typical distributed retrieval scenario, first the broker compares the entered
queries with available collection summaries. Then a few top collections with the most
similar summaries are selected by the broker. Next, the query is passed to the selected
collections and they return their results usually with a short snippet for each answer.
Finally the returned documents from the selected collections are merged and presented to the user.
Text-summarization techniques can be helpful in two of the discussed steps. To provide a
summary for each collection, current techniques use a few sampled documents or a short
description that has to be provided manually. Effective collection summaries improve
the collection selection performance. In addition, document summaries (snippets) returned
by the selected collections can be used for merging.
The question is, how linguistic techniques can be used for producing
effective summaries for both collections and documents.
Bridging the Gap between What is Known and What is Unknown
Consider a hypothetical universe of possible facts which are derivable from some text data set, and a hypothetical information seeker. Some of these facts may be known by the seeker, other facts may be known only by other individuals who have contributed to the data set, and other facts are not known by any individual contributor.
Current search capability often focusses on retrieving a document which approximately matches a statement of some fact provided by the seeker - the query. If the seeker simply wants to retrieve a document they already believe exists, then the task reduces down to human recall: trying to remember what distinctive elements of the item would be best to use in the query (1).
Beyond plain retrieval, the seeker may wish to find other facts which are unknown to them, but which are explicitly stated in some documents. This task is normally performed by formulating the best known query and browsing through retrieved documents which approximately match that query. Obviously, the success of this depends on: the initial level of knowledge of the seeker and their ability to formulate a query with appropriate vocabulary and specificity (2). Query refinement is an important method for the seeker here.
If the seeker wishes to go further and discover facts which are not known by any text contributor, but are derivable from the data, the only option currently is for the seeker to: absorb the retrieved documents and synthesize the information using their own cognitive capabilities (3).
Our research addresses problems (1), (2), and (3) above, by the provision of an Analyst Support System which employs unsupervised machine learning to construct a semantic map of the text data set. This mapping system employs both distributed and symbolic representations, and objects, attributes, and relationships in its model of the knowledge. This system, called Leximancer, addresses these three problems as follows: concept building by seeded thesaurus discovery to enhance recall while maintaining precision, concept profiling to reveal the semantic context of a target concept or a document set, and abductive reasoning in sematic space to discover hypotheses not present in any one document.
Starkie Enterprises - Next Generation Search Research
Research interests pertinent to workshop theme:
Natural language Processing and Language Technology, Information Retrieval, Formal Semantics, Formal Syntax and Morphology.
This presentation will briefly describe the question-answering product being developed at Starkie Enterprises. The presentation will describe the underlying philosophy of the research, the parsing technique & the knowledge representation used. Starkie Enterprises have developed a new method of inferring ultra-fast robust parsers from corpora ideally suited to parsing billions of web pages. A very brief introduction to the knowledge representation being investigated will be presented, along with an explanation of how it is ideally suited to the task of information retrieval, and how the information can be used by automated reasoning systems.
Research Interests:
I'm interested in the development of robust linguistic analysis techniques (e.g. Lexical Cohesion Analysis, Textual Entailment and Paraphrase Identification, Toponym Resolution) for use in NLP and IR applications such as Ad hoc Retrieval, Text Summarisation, Question Answering and Text Classification. I am currently involved in the NICTA I2D2 project which is investigating the use of NLP techniques to enhance Geospatial Information Retrieval, i.e. improving retrieval results for queries containing references to place names.
Abstract Title: A plea for more analysis and less metric tuning
Recent results from research initiatives such as TREC and DUC suggest that as the required quality of a user response increases, IR techniques greatly benefit from NLP, e.g. changing a ranked list of passages to a factoid based answer. In contrast, little or no progress has been made in the application of NLP techniques to general ad hoc retrieval tasks. However, NLP may have a "niche" role to play in improving retrieval results for specific query types such as geospatial queries. Hence, it is the responsibility of the IR community to provide a more detailed failure analysis of their TREC-style experiments, i.e. which query types are they performing poorly on and why.
Similarly the NLP community can be critised for focusing too much on automatic metrics scores and not enough on detailed analysis. As more and more open source NLP components are made available, researchers are building large pipeline architectures. Many of these systems are credited with improving QA and Summarisation performance over baseline approaches such as bag-of-words. However, as these systems usually comprise of a conglomerate of NLP components it is unclear which of the NLP techniques contributed most to the resultant improvement. If NLP researchers knew which components needed further development then additional gains may be possible. This suggests that we need to augment current application-based evaluation methodologies with component level analysis at different points in the NLP system pipeline.
Future search tools will have to work with the tremendous range of
online information available to users, including the entire Web but
also corporate sources such as subscription services or databases and
personal sources such as calendars or email archives. The standard
centrally-indexed model of search will not work in these situations,
which argues for a "metasearch" or distributed model; however this
raises many challenges. How can we determine which of the possible
data sources, as different as email archives or online databases,
might have information the user needs? How can we rank, cluster, or
otherwise display results from such different sources? Can we learn
something about individual users, and if so can we generalise from
this to other users? And how can we evaluate tools we might build?
We have a prototype personal metasearch tool, which provides a testbed
for experiments in these areas, and can report some preliminary
results in using simple language models to select data sources and
rank results.
Yohannes Tsegay
(RMIT), Andrew Turpin (RMIT), and Dave Hawking (CSIRO) [presentation slides]
Snippet Extraction for Web Search Engine Result Lists
Internet search engines attempt to present small summaries, or
snippets, of retrieved documents in their result lists. Though
considerable work has been done to improve the effectiveness and
efficiency of retrieving relevant documents, little work has been done
on snippet extraction.
Existing search engines perform snippet generation by processing each retrieved document in its entirety from beginning to end, extracting and scoring sentences that contain query words. By re-ordering sentences in documents so that sentences more likely to appear in snippets are at the front of the document, the process can be made more efficient. Moreover, "bad" sentences can be omitted from the locally stored version of documents altogether. Our research focuses on this, and other techniques, for making snippet generation more efficient.
Dynamic Relevance Criteria During Search Andrew Turpin(RMIT) Most current information retrieval (IR) experiments follow the Cranfield model where there is a fixed collection of documents and queries, and a fixed set of relevance judgments for each query. That is, for each query a list of documents that have been judged on a (typically binary)
scale between relevant and irrelevant to that query. However, in reality users change their relevance criteria as they read documents and learn more about a topic on which they are searching.
In particular, when people read a typical results list of the type returned by Google, Yahoo, etc, their idea about what is relevant and what is not may change. This in turn will affect which link they select to view. This undermines the traditional IR evaluation methodology which assumes a fixed relevance judgment for each document a priori. I am interested in designing simple experiments to begin to characterise this behaviour, and in turn designing a methodology for comparing IR systems that will allow for dynamic relevance judgements. I will need help from other people at this forum.
Most text retrieval research is based on the assumption that
a user is looking for documents that are "about" a topic, or
are a "known" website. However, not all searches are of
that nature. A small body of research examines the problem
of searching according to linguistic criteria. The typical
user is assumed to be someone who is operating in or learning a second or
foreign language.
Our current work in the above field has examined the idea of
retrieving text based on its readability as a foreign language.
Most readability measures use simple statistics based on sentence
length and estimates of vocabulary difficulty. We have found that
the frequency of cognates has a small effect on perceived
readability for foreign language learners. We are exploring
techniques for automatically identifying cognates in text.
I'm working on projects that use natural language processing to process
medical notes, such as pathological reports and clinical notes. Medical
report, clinical notes contain a lot of noise, such as weak grammar, term
variations/misspelling in core medical terms. These poor quality texts are
difficult for general information system to understand. However, most of the
analyses of medical and clinical notes require deep understanding of the
text. We are interested in studying such noisy and minimal grammatical
texts. The project aims to use NLP techniques extract useful information
from unstructured clinical notes. At the first step, we focused on studying
the medical terminologies. We built a system that index medical concepts in
free text into a medical terminology (SNOMED CT) and transform in into
knowledge representation. We developed a lexical base mapping algo-rithm and
enhanced it with NLP techniques. The system is able to identify negation and
qualification of concepts in medical notes.
Medical terms in text convey a large amount of knowledge used in
communication between clinicians. However medical terminology has a lot of
variations, for example different preferred terms and synonyms. It is
believed that coded information can improve information retrieval. Using a
standard terminology / ontology to support encoding clinical information is
critical. Indexing user query and medical concepts in text is a challenging
problem. Another important issue is the standardisation of the terminology.
Terminologies used in different countries, hospitals, sections in hospitals
are different, which can cause problems in indexing the medical concepts.
There is a need to map different terminologies into a standardised
terminology that support information systems. However, it is impossible to
express all possible terms in one terminology. Term composition and
decomposition is a way to construct meaningful term and relationship for
absent terms.
"Bang for Buck" -- what context improves retrieval, and is it worth it?
A key paper by Ragnar Nordliein Sigir '99 on ``"User revealment" -- a comparison of initial queries and ensuing question development in online searching and in human reference interactions'' -- demonstrated that context was discovered by reference librarians -- and that was the key to their success. At Sigir 05, a workshop on context revealed just how many of elements of context could be discovered. But which should be captured? The cost of capturing them all is beyond humans and machines, and the likely marginal benefit gradually reduces as we capture more context.
Language technologies have been most typically applied to content, rather than context, and yet it can be "milked" in quite simple ways, quite powerfully. For example a query of the form "When..." rather than "What..." can trigger quite different retrieval responses and has been routinely used to considerable effect in QA systems. This simple example shows that language technologies can be useful tools for inferring context that is not typically available to bag of words approaches where the contextual clues may well be treated as stop words in a query. When proceeding more deeply, rather than looking at a single query, but a dialogue, each step in a dialogue can be used as a context revealment opportunity. Context can also improve the delivery of information -- context is used heavily in language generation, and in other work we look at improving retrieval delivery in the same way.
We are interested in experiments that compare the cost of capture of context versus the benefit of that context -- there is typically a very high cost to the user if they have to provide explicit contextual disambiguation -- can we use language technologies to reduce the cost and increase the benefit of knowing the context?
This project aims to help clinical decision making by providing most up-to-date and relevant knowledge from fast-growing on-line repositories of clinical findings like MEDLINE and BioMed Central. We are trying to retrieve the most similar instances of case reports according to the unique clinical profile of the current patient, including symptoms, laboratory test results, treatments, and health histories. There are two important issues to be addressed in the project. Firstly, an information extraction system is required to identify text segments that contain information of clinical profiles of patient. The extracted text can be further normalised into template-like forms which provide an intermediate level of knowledge of clinical data. Secondly, retrieval of most relevant cases is more based on a combination of detailed clinical profiles rather than a simple keyword-based searching.
We try to build a system that exploits full research articles rather than only abstracts which generally ignore detailed clinical histories of the diagnostic and treatment process, such as readings of laboratory tests, and descriptions of symptoms of patient. As a preliminary attempt, a novel mark-up tag set has been proposed to cover a wide variety of semantics in patient case reports. We have created a manually annotated corpus with 75 journal articles and 5,117 sentences as the starting point for learning our information extraction system. The next step of our project is to normalise the extracted text and to find a better way of matching relevant cases with similar clinical profiles.
With the emergence of social websites such as flickrs and del.icio.us,
average Internet users are becoming more and more active in contributing and
sharing information on the web. Various social networks of Internet users
are formed and evolved around those social websites. Investigating the
effects of social networks on Internet users' information retrieval and
contributing behaviour may shed light on the current search technology.
Many social websites provide information regarding most popular tags, most
recent news, recommendation lists based on friends' preferences, yet there
is no clear evidence on how such information may influence users search
behaviour.
I am planning to conduct a series of controlled laboratory experiments to
observe and study the users' information retrieval behaviour and the effect
of explicit social network information. The research needs to draw on
theories from individual decision making, social cognitive science to
understand the decision process of choosing the keywords and ranking the
results. The rapid adoption of web publishing tools like wikis and social
software enables many Internet users play both the information provider and
consumer roles. Understanding the mental process of those users and the
social influence on the process may provide guidelines in managing and
searching the information contributed mainly by the average Internet users.
Directions for Search with Rich Language Processing
Web-based search engines solve a narrow problem extremely well. With
diverse sources of evidence (vast quantities of data, anchor text and
link evidence, past queries, and implicit user judgements), the great
majority of searches are accurately resolved using bag-of-words
methods. Such success now extends to tasks such as image search and
cross-lingual retrieval.
However, there are many search-related tasks that could well benefit
from richer approaches to text processing. These include question
answering; query-based fact extraction; authorial support; searching
tailored to reading level; directed clustering; identity
disambiguation; and querying of data streams such as speech
transcripts. Such investigations could concern data from the open web,
but more profitably might focus on better-managed data such as that
from a digital library, institutional website, or other special-purpose
resource. Challenges include how to undertake such search tasks and
how, algorithmically, to provide them efficiently on a large scale.
The ongoing theme of our research has been algorithmic mechanisms for
efficient, effective search. In this context we have over the last
decade investigated a wide range of search enhancements such as use of
phrases for collection browsing, spelling correction, bioinformatic
search, cross-lingual information retrieval, speech retrieval, and
authorship attribution. Techniques based on rich language processing
are a natural extension to this work.