HCSNet Next Generation Search Technology - Presentations

Links

Presenters

  1. Elena Akhmatova, elena@ics.mq.edu.au, Macquarie University (student), Sydney
  2. Eric Bae, kheb@csse.unimelb.edu.au, University of Melbourne (student), Melbourne
  3. Timothy Baldwin, tim@csse.unimelb.edu.au, Melbourne
  4. Bodo Billerbeck, Bodo.VonBillerbeck@sensis.com.au, Sensis, Melbourne
  5. Peter Bruza, p.bruza@qut.edu.au, Queensland University of Technology, Brisbane
  6. Robert Dale, rdale@ics.mq.edu.au, Macquarie University, Sydney
  7. Shlomo Geva, s.geva@qut.edu.au, Queensland University of Technology, Brisbane
  8. Michael Haugh, m.haugh@griffith.edu.au, Griffith University, Nathan, Queensland
  9. David Hawking, David.Hawking@csiro.au, CSIRO ICT Centre, Canberra
  10. Baden Hughes, badenh@csse.unimelb.edu.au, University of Melbourne, Melbourne
  11. Sarvnaz Karimi, sarvnaz@cs.rmit.edu.au, RMIT (student), Melbourne
  12. Su Nam Kim, snkim@csse.unimelb.edu.au, CSSE/University of Melbourne, Melbourne
  13. Andrew Lampert, Andrew.Lampert@csiro.au, CSIRO ICT Centre (student), Sydney
  14. Jose Lay, Jose.Lay@csiro.au, CSIRO ICT Centre, Sydney
  15. Yuval Marom, Yuval.Marom@infotech.monash.edu.au, Monash University, Clayton, Victoria
  16. David Martinez, davidm@csse.unimelb.edu.au, University of Melbourne, Melbourne
  17. Robert McArthur, Robert.McArthur@csiro.au, CSIRO ICT Centre, Canberra
  18. Diego Molla-Aliod, diego@ics.mq.edu.au, Macquarie University, Sydney
  19. Scott Nowson, snowson@ics.mq.edu.au, Macquarie University, Sydney
  20. Cecile Paris, Cecile.Paris@csiro.au, CSIRO ICT Centre, Sydney
  21. Luiz Augusto Pizzato, pizzato@ics.mq.edu.au, Macquarie University (student), Sydney
  22. Brett Powley, bpowley@ics.mq.edu.au, Macquarie University (student), Sydney
  23. Hong Liang Qiao, hong.qiao@lexxe.com, Lexxe Pty, Sydney
  24. Falk Nicolas Scholer, fscholer@cs.rmit.edu.au, RMIT, Melbourne
  25. Milad Shokouhi, milad@cs.rmit.edu.au, RMIT (student), Melbourne
  26. Andrew Smith, asmith@humanfactors.uq.edu.au, University of Queensland, Brisbane
  27. Brad Starkie, bstarkie@starkieenterprises.com, Starkie Enterprises, Melbourne
  28. Nicola Stokes, nstokes@csse.unimelb.edu.au, NICTA, Melbourne
  29. Paul Thomas, paul.thomas@anu.edu.au, Australian National University (student), Canberra
  30. Yohannes Tsegay, ̀1Ǧ940@student.rmit.edu.au, RMIT, Melbourne
  31. Andrew Turpin, aht@cs.rmit.edu.au, RMIT, Melbourne
  32. Sandra Uitdenbogerd, sandrau@rmit.edu.au, RMIT, Melbourne
  33. Yefeng Wang, ywang1@it.usyd.edu.au, University of Sydney (student), Sydney
  34. Ross Wilkinson, Ross.Wilkinson@csiro.au, CSIRO ICT Centre, Melbourne
  35. Yitao Zhang, yitao@it.usyd.edu.au, University of Sydney (student), Sydney
  36. Ying Zhou, zhouy@it.usyd.edu.au, University of Sydney, Sydney
  37. Justin Zobel, jz@cs.rmit.edu.au, RMIT, Melbourne

Abstracts

  1. Elena Akhmatova [presentation slides]

    Textual Entailment Recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed from another text. Thus we are testing up to which level NLP systems could claim to "understand" language by designing systems that can cope with textual inference. My current interest in the field - a subtask of the work I am doing -- is identification and classification of inference mechanisms lying in the basis of proving of the entailment relation between text fragments. The main goal is to built the system dealing with textual inference on the level exceeded the state-of-art level now.

    Textual entailment is believed to be relevant for Question Answering, Information Retrieval, and Information Extraction tasks in general. My intuition is that it is relevant for any application aiming at particular types of information to be found in restricted amount of resources, or matched to each other. It can be used in QA systems when the assumption that the answer will be presented in a simple form somewhere in the text did not actually work and a more complicated answer search has to be done. It might be useful in dialog or tutoring systems where easy ways of checking if a user's answer on a particular question matches the expected answer are not possible.

  2. Eric Bae [presentation slides]

    Research interests pertinent to the workshop theme:

    The primary research area I am interested in are data clustering and clustering validation. My recent paper (accepted in Australian AI Conference) discusses a novel technique which provide intuitive clustering similarity values over the traditional methods. Moreover, the technique is suitable for stream data clustering where clusters may not have overlapping regions.

    Another main focus of my research is retrieving an alternate clustering given an original clustering. This arises from the fact that there could be several clusterings present within data and current clustering techniques ignore this fact.

    Document clustering has been an important component in IR systems over the years and has proved to be effective in helping users to find relevant information more efficiently. My research could add more valuable information to the current state of document clustering and help building the next generation IR system.

    Thoughts on issues deemed to be important and potential points of interaction with other disciplines:

    I believe the next generation IR system must incorporate more accurate organization of information instead of a plain ranked-list of documents. This makes the cluster analysis task an extremely valuable tool. However, because clustering suffers from the subjective notion of "similarity" and highly dependent on the clustering criterion, investigating novel methods to generate highly informative clusters of documents are critical issues.

    Consequently, this requires tighter involvement of NLP, IR and data mining where techniques from NLP helps IR systems to retrieve more relevant documents and subsequently helping data mining tasks (i.e. clustering) to generate more effective sets of clusters.

  3. Timothy Baldwin [presentation slides]

    Research interests: language technology, web and text mining, machine learning

    Issues: cross-document indexing, language identification/segmentation in web documents, information extraction from web documents

  4. Bodo Billerbeck (Sensis), Andrew Turpin and Falk Scholer (RMIT) [presentation slides]

    Web search engines typically present search results based on the user query, taking into consideration statistics such as document and term frequencies. Other statistics that give an indication of the popularity and content of documents, including anchor text and page rank, are also of importance.

    More expressively collaborative mechanisms are emerging now. Assuming no other clues about the user intent are available, ideally an editor would rank all pages in response to the user query, judging the relevance of each document. Similar useful would be the judgements by other users that have issued the same query previously; however, users are typically reluctant to provide a judgment of relevance. Implicitly, this information can be deduced from the clicking behaviour of earlier users, collected during their interaction with the search engine. By now, most engines have collected several years worth of query logs and click through information.

    How can feedback of this type be made use of in order to attribute general document quality scores, or to re-rank results according to a particular query? Many factors other than relevance contribute to a user clicking on a particular result, for instance a misleading document summary and the general habit of clicking on higher ranked results rather than those lower in the rankings. How can reliably be determined whether a page is of high quality (either in response to a query, or just by itself) according to the clicks?

  5. Peter Bruza [presentation slides]

    Next Geneeration Search from and Information Ecologial Perspective

    At the outset, our position repudiates the view that Google has solved the search problem. In the not to distant future, our information environment will be made even more complex by all sorts of information processing and display devices combined with the ongoing information explosion. In parallel, the object of search will transcend individual documents as the distinction between structured (databases) and unstructured information blurs even further. For example, the object of search maybe the discovery of a meaningful connection between concepts, or a collection of online services to close an agenda, e.g., to open a coffee shop in Brisbane. In this environment, queries, the triggers for search, will be increasingly tacit, in contrast to the explicit queries commonly formulated today. The context of the search will need to be taken into account, mainly because context provides the means to effectively filter relevant information from that which is not. As a consequence, current search technology, which essentially only matches documents with query representations will need to be endowed with an inferential capability, for example to infer tacit queries from a given situation, draw appropriate context sensitive associations in order to support exploratory search behaviour which comprises a mixture of serendipity, learning and investigation. It is important to note that the inferred context sensitive associations should accord to a large degree with those we would make. This suggests operational socio-cognitive semantics - the technology should manipulate meanings that accord with those we harbour.

    The term "ecological" suggests a holistic approach to how people, information and context interact in relation to the search task, whether explicit, or tacit. This stance is in direct contrast to the current bifurcation between user and information space, with context, including user and task models, largely ignored. A serious candidate for furnishing operational forms of socio-cognitive knowledge representation is semantic space models. These have an encouraging track record of replicating humans across a variety of information processing tasks. Thus far, such models have mainly been investigated within cognitive science. The next generation of search should look carefully at these models and see how they can exploited. A highly speculative line of investigation would pursue the recent discovery connecting semantic space models and quantum mechanics (QM). This offers the possibility to build on some interesting new theory relating information retrieval (IR) and QM by a leading IR theorist Keith van Rijsbergen. Pursuing the QM line of investigation not involves the discovery of a radically new class of models for search, but also allows us to question our philosophical positioning which is almost exclusively realist.

  6. Robert Dale [presentation slides]

    Intelligent Text Processing at Macquarie's Centre for Language Technology

    Macquarie's CLT engages in a range of activities in areas relevant to Next Generation Search, and others will report on some of these. In my presentation I'll give an overview of some activities we have been carrying out in the general area of intelligent text processing: in particular, our view is that next generation search systems will benefit from indexing that is based in linguistically-motivated information extraction, rather than treating documents simply as bags of words.

    I'll say something about each of the following projects:

    • in collaboration with the Capital Markets Cooperative Research Centre, we have been pursuing the development of a suite of information extraction and text summarisation tools in the GainSpring project;
    • in work for the DSTO, we have been exploring temporal expression recognition and normalisation, and cross-document entity tracking; and
    • in a new pilot project collaborative with UNSW, we are focussing on sentiment analysis in newspaper headlines.

    We're keen to find ways of integrating what we are doing with the work of others, particularly those working in information retrieval.

  7. Shlomo Geva [presentation slides]

    Natural Language Queries for XML IR

    The wide acceptance and rapidly growing use of XML as a standard storage and retrieval data format blurs the historical divide that exists between collections that are used in Information Retrieval on one side and in Database Retrieval on the other. While most information retrieval systems operate at the document or passage retrieval level, it has become possible with XML marked up collections to take advantage of the rich semantic information that is embedded in the documents themselves. It is possible to specify in queries which elements are of interest and it is possible to return XML elements. However, traditional IR is still working with query models that do not utilize structural retrieval cues that users are able to provide. We propose to extend the natural language query model (NLQ) and to support natural language queries that not only specify content requirements with respect to an information need, but also specify structural requirements. This extended NLQ model lends itself to immediate application of NLP techniques. We describe a the results of early attempts to support NLQ for XML information retrieval.

  8. Michael Haugh [presentation slides]

    Relevance, salience and emergence: perspectives on information retrieval in pragmatics

    My main research interests lie in the field of pragmatics and intercultural communication. Drawing from conversation and discursive analysis, I take a broadly social contructivist view in analyzing pragmatic phenomena. In relation to this workshop, my interest in applications of relevance theory, cross-cultural rhetoric and information flow, and the co-constitution of intentions through interaction are three areas which may potentially have some relevance to the issue of developing more efficient information retrieval systems. In other words, three issues about which research in pragmatics might possibly have something to contribute are: (1) defining what is deemed "relevant" in retrieving information; (2) analyzing how the flow of information in texts is structured to give prominence or salience to certain elements; (3) understanding how the intentions underlying information searches may be interactively achieved (i.e. "emergent").

  9. David Hawking and Paul Thomas [presentation slides]

    We envisage a future in which search is characterised not only by new and improved technogies but by the ability to conduct context-sensitive searches over multiple heterogeneous collections, including private, corporate, public and subscription sources. Standard IR evaluation questions become deeply challenging in this context: How much benefit is actually derived from this new technique? Does this great idea actually make any difference in practice? Is system A better than system B? Well known evaluation techniques based on test collections and human experimentation in the laboratory are inadequate or limited in various ways.

    In response, we have proposed a simple two-panel evaluation tool which takes the place of a person's ordinary search interface and presents two alternative sets of results side-by-side, randomised for left and right. The person is invited to indicate "prefer left", "prefer right" or "no difference" and may be asked for additional information through unobtrusive questioning. We can demonstrate the tool, report results obtained during validation of the approach and discuss its strengths and weaknesses.

  10. Baden Hughes [presentation slides]

    My broad research interests of relevance are in the following areas:

    • Digital libraries: particularly search services for distributed data environments and metadata standards
    • Web data mining: particularly large scale semi-structured data acquisition and information retrieval engines

    Particularly I am interested in the intersection of generic search technologies and domain-specific highly structured, such as linguistic data on the web; and in the application of broad coverage classification techniques (eg for language identification of documents) to web data as a catalyst for higher order domain specific search applications.

    Points of Intersection / Debate :

    • Despite decades of research in information retrieval, and the emergence of more generally accessible interfaces for information discovery such as web search engines, a reasonably standard output display method still dominates: ranked lists. It is clear that a wide variety of alternative display types for engaging in information discovery tasks are available, but these are not actively deployed. Evidence from cognitive science research shows that different modes of information engagement result in different information outcomes, yet there is comparatively little research in how knowledge about human communication preferences can be brought to bear in the web search context, particular in results display and user interaction.
    • In the linguistic domain, fine grained semantic distinctions can have profound effects on interpretation of communicative intent; a single utterance or piece of text can have many different interpretations depending on the context it is presented in, and the perspective from which the consumer approaches it. This semantic ambiguity contrasts markedly with the granularities typically adopted in information retrieval when considering such pseudo tasks as "Is this document relevant to a given query ? Yes or No". While some research has been conducted into graduated assessments which allow for greater human interpretation of the concept of 'relevance' between a query and a document set, there remains a large number of open questions, in both science and engineering terms, as to how to effectively provision for semantic ambiguity in determining relevance.
  11. Sarvnaz Karimi Falk Scholer, and Andrew Turpin [presentation slides]

    Machine transliteration, which deals with out-of-vocabulary (OOV) terms including proper names and technical terms, is the main focus of this research. We are interested in transliterating terms between English and Persian in particular, and may extend that to languages with similar scripts, such as Urdu. Therefore, this leads the work on transliteration methods and parameters affecting this process, namely, corpus construction related issues and character alignment difficulties which both affect the effectiveness. Also, apart from generative transliteration which generates terms based on previously seen word pairs, discovery of transliteration pairs out of existing parallel documents is a helpful method which will be covered in this research.

    Statistical methods are widely used in transliteration where it could be more viewed from linguistics aspects which are directly related to this task due to the habit of people speaking in involved languages. In addition, any multilingual information retrieval and question answering system, and particularly machine translation applications which may need to handle OOV terms, can benefit from adapting automatic transliteration techniques.

  12. Su Nam Kim [presentation slides]

    Multiword expressions(MWEs) are lexical items that can be decomposed into multiple simplex words and display lexical, syntactic and semantic idiosyncracies (e.g. apple pie, hand in, make a mistake, a piece of cake). MWEs are used in many NLP application such as machine translation and IR. For search engine, as multiple-word queries, MWEs are interesting issue. The compositionality/decompositionality of MWEs can provide the variation of given queries. For instance, with compositional MWEs as a query, we have to consider the components of given MWEs as possible sub-queries while non-compositional MWEs cannot be splited as sub-queries. Also, the semantic relations in MWEs can narrow down the searching boundary. As a perspective of query handling, it is worthwhile to study MWEs that can provide an efficient queries to search engine.

  13. Andrew Lampert [presentation slides]

    My research focuses on the intelligent coordination of information retrieval, aggregation and delivery, based on reasoning about the context of a user's interaction. In particular, I'm very interested in communication patterns in enterprise email as a means of inducing structure in such data. My current research is focussed on applying speech act theory to email communication, as a means to automatically identify the discourse structure within email conversations. More broadly, I'm interested in the problem of efficiently acquiring information that can be used to characterize conversation structure in any form of textual human-to-human discourse. This structure can then hopefully be exploited to allow more sophisticated (and context-sensitive) search, presentation and summarization of such data.

    There are clear points of interaction with many other disciplines, including:

    • Human-Computer Interaction: How can we usefully present information about discourse structure to end-users?
    • Conversational and Discourse Analysis: What analyses can be applied to understand the structure and patterns in text-based computer-mediated communication? How can such analysis help facilitate more sophisticated search and presentation of such data?
    • Information Retrieval: How might discourse structure be used to guide or adapt the retrieval of relevant information?
    • Data Mining: What data mining tools and approaches can be used to analyse and learn about discourse structures within large data sets?
    • Natural Language Processing: What NLP tools and techniques can be used to help process textual human-to-human discourse in order to apply discourse and conversational analysis techniques?
  14. Jose Lay [presentation slides]

    Multimedia Information Retrieval by Artificial Languages

    This work deals with semantic retrieval of multimedia. In addition to verbal (natural) language, information in a multimedia document is also communicated by using non-verbal languages. For instance, to make sense of a solitaire game video, the audience will need to know the set of cards and the rules of how the game is played.

    In this workshop, we show how retrieval of non-verbally expressed information in a multimedia document is better performed by indexing documents with the vocabulary elements of the non-verbal language and to operate queries by using those vocabulary elements. In so doing, a great number of semantic queries can be supported through post-indexing coordination of the vocabulary elements.

  15. Yuval Marom [presentation slides]

    My research involves response automation where a user expresses a problem or inquiry that is longer and more involved than a typical search-engine query, and therefore the required technology goes beyond conventional question-answering systems. The particular application I am currently investigating is an email-based helpdesk for computer-related problems. Although the underlying problems that trigger the inquiries to the help-desk revolve around a relatively small set of issues, there is a high textual variability in the emails due to the way that customers express themselves. Further, it is often very difficult to pin-point what the actual query or question is, due to the fact that customers provide background information that obscures the question, and sometimes customers even omit the question and expect it to be inferred from the background information. To make things even more difficult, customers' emails are often ungrammatical and poorly structured. Therefore, domains such as this one face an important challenge of coming up with sophisticated representations with which to build models of users' inquiries that lead to useful models of response generation.

    The SRI call should encourage researchers to investigate "user queries" that go beyond the traditional ones. People are now well and truly reliant on search technologies for addressing their informational needs, and will start to expect such technologies to deal with more complex inquiries, such as the email-based ones seen in help-desks. Our research is showing that such technologies certainly need to have a deep understanding of the users' inquiries, and further - of the users' intent. This may involve building user models that are based on demographical factors as well as cognitive ones such as interest and expectation. This kind of research can therefore benefit from interaction with fields such as cognitive science and HCI.

  16. David Martinez [presentation slides]

    My main interest in this area lies in the processing of multi-document discourses (e.g. newsgroup-style data streams) for information delivery. These information sources present important challenges for simple term matching methods, and require new approaches that take into account the structure of the data. The new tools would perform a linguistic analysis of the texts and obtain a conceptual representation of the segmented data streams, linking them to a factoid-based summary that can be easily accessed by the user.

    I am also interested in the application of semantic similarity measures in IR for improved recall. For this we would rely on Language Technology tools that provide paraphrase identification and word similarity scores. This knowledge could be integrated in a query-document similarity formula for better ranking of the documents.

  17. Robert McArthur [presentation slides]

    A trend in information retrieval is towards moulding the search in view of the individual user. Information about the individual - user Context - is needed to do this. I have been working in the area of the geometric representation of meaning derived from the analysis of the textual communication of individuals. This interdisciplinary area involves sociology, cognitive psychology, information science and linguistics; the underlying theory is pragmatic socio-cognitive semantics.

    The user context derives from modelling context-sensitive associations and inferences that humans easily perform, and suggests associations in context that we would make were we not epistemically challenged. These associations and inferences may be generic, as well as specific to an individual or community. Such associations are necessary for exploratory search, which emphasises serendipity, learning and investigation, as context about the user is critical in this more uncertain search paradigm. The research has potential within the new areas of "social computing" and the "science of identity".

  18. Diego Molla-Aliod [presentation slides]

    Question answering is about accepting questions writen in plain English and finding the answer by searching through unedited text documents. Our project, AnswerFinder, combines various technologies to identify the sentence containing the answer and extract the exact answer. Currently we are studying the use of document retrieval, question classification, and finding the answer to fact-based questions, definition questions, and questions that require a summary extracted from several documents.

    A principal technique used in AnswerFinder is the representation of questions and text sentences in a graph with concepts interrelated and the use of machine-learning methods combined with Graph Theory to determine if a sentence contains the answer and extract the answer. By applying machine learning the system can be ported to other domains and other languages. Furthermore, I am exploring the application of these machine learning techniques at various levels of sentence representation ranging from semantic networks to syntactic dependencies to word neighbourhood.

    As for a burning issue to handle in the future, I would mention question answering for restricted domains. These domains may be rich in lexical or ontological resources but they may not have enough data to warrant the use of general (redundancy-based) methods.

  19. Scott Nowson [presentation slides]

    My background is in cognitive science and natural language, and I am moving into more concrete language technologies. My PhD thesis concerned a study of language use in online journals (personal weblogs) as it relates to the personality and gender of the author. This involved working with language beyond mere function words or syntactic categories and I used dictionaries of statistically-derived psychologically-related definitions to explore individual relationships with, for example Extroversion. I am interested in how we can use information about an author to develop our understanding of a text and vice versa. In my current work, I am to be looking more specifically at summarisation and information extraction, using business reports as a data source. This may be a far less personal genre of text, but similar principles can be applied when looking to understand text.

    Content analysis is a core approach within information extraction, but to look beyond specific content into the language used can enrich work in the area. Specifically, by looking at categories of words such as those described above, by bringing more psychological meaning to language, we can bring analysis a step closer to the way humans may approach language processing. I believe my experience in this area, coupled with my desire to be come more involved in the community surrounding my current field provide me with much to both bring to and take away from this workshop.

    Search has evolved considerably in recent years. From finding all relevant information, through sorting by relevance, removing duplicates and so on until we have an incredible amount of data to search through. Searching by content is no longer enough to reduce the search space. The semantic web and tagging are one approach. Another, the one I am most interested in, is search by style, not just content. One application of my Thesis work is to be able to search documents by the personality of the author. When searching for film reviews, it might be nice to find out what people with similar dispositions to yourself thought of a movie. Market research looking for opinions of products might be more focused if concentrated only on the weblogs of their key demographics for example.

  20. Cecile Paris, Nathalie Colineau, and Ross Wilkinson [presentation slides]

    In our work, we are looking into developing information systems capable of providing users with information appropriate for their needs and delivering it so that it is both understandable and useful to them. Clearly, search is an integral part of such systems, and, we in fact believe that these systems are the next-generation search technology.

    Our approach so far to achieve this aim has been to combine information retrieval and natural language generation technologies. We now believe that it is not enough to simply put together a search engine and a discourse generation engine. Instead, these technologies must be truly integrated and enhanced with notions of context.

    This brings a number of issues, including: how do we contextualise search behaviour, which implies asking questions of context definition and acquisition; how do we identify what aspect of contextual information is most useful to guide or constraint search and delivery (can we design various experiments to do this, and how?); which aspects of the resulting approach and system do we evaluate and how?

    We believe these questions would benefit from being approached from different perspectives outside our own (NLP and LT), in particular: Information retrieval, human computer interaction, data mining and conversational and discourse analysis.

  21. Luiz Augusto Pizzato

    My PhD research focuses on defining a framework for information retrieval that incorporates relational information in text such as bigrams and syntactic dependencies, and the relations between arguments and predicates. Since the framework is able to incorporate many linguistic-oriented features, its validation will be conducted using an IR dependant task which is linguistically demanding, such as Question Answering.

  22. Brett Powley [presentation slides]

    Citation Analysis and Next-Generation Search

    My primary research interest is extraction and computational analysis of citation information from corpora of academic papers. Analysis of citations can potentially tell us a variety of useful information about papers and the relationships between them. There are three levels on which we can analyse citations. Analysis of an individual citation and its context can tell us the reason for the citation, and therefore the relationship between the citing and cited work. Analysis of collections of citations from a particular work can provide an overview of how that work places itself with respect to other works. Analysis of collections of citations to a paper -- or in other words, what other researchers say about that paper -- can provide useful information about how other researchers use and view the work. Analysis of interdocument relationships using citations is a potential source of useful information on which to base tools for search and navigation of corpora of academic documents. In paricular, analysis of what other people say about a document may provide more useful search terms than anything in the document itself.

    The important issues to my research centre around the sentences containing citations and their context in order to classify citation function, and how to use use citing sentences for applications such as document summarization and search. I am particularly interested in how citing sentences can be viewed as part of a discourse, and how dicourse analysis can offer insights into how we can determine the context and meaning of a citation. Similarly, I am interested in exploring the relationship between formal semantics and computational linguistics, and how we can feasibly extract and represent semantic information from citing sentences. More broadly, I would be interested in exploring how inter-document relationships could improve document navigation and search, and how citations and other reporting sentences could be the basis for such relationships.

  23. Hong Liang Qiao [presentation slides]

    Lexxe Search Engine - A 3rd Generation Internet Search Engine Powered by Advanced Natural Language Technology

    The philosophy upon which Lexxe search engine is built is different to the traditional 2nd generation search engines, in that the object of information processing is language rather than symbols.

    Given this different starting point, which marks the change of generation, a 3rd generation search engine has to understand natural language to certain extent. Even in the case of "Keyword" searches, a "one formula fits all" method adopted by most of the current search engines, including Google, has become very questionable. What is even worse is the introduction of "popular link" (PageRank) by Google, which simply proved the incompetence of such "one formula". The "popular link" factor itself is a very speculative approach towards information retrieval. However, the fundamental problem with the 2nd generation search method is the general approach of "Symbolic Computing", while the 3rd generation search technology employs a "Linguistic Computing" approach. Since webpages are mostly made up of language texts, we want to argue that it is most natural to assume that search is a linguistic computing activity.

    Lexxe search engine has four major features using Natural Language technology: 1) Keyword-based search with phrase recognition; 2) Short Question Answering trying to find exact answers directly in the webpages; 3) Clusters generated on the fly offering themes and categorization of the result pages, and 4) Irrelevant pages screening.

    Although Lexxe is still in its Alpha Version development, it has already attracted a considerable number of users through the showcase of its innovative technology. Many comments have been made on the Internet about Lexxe. They offer many insightful suggestions, which Lexxe will take very seriously in its further development.

  24. Falk Nicolas Scholer [presentation slides]

    Research interests: query log analysis; retrieval models; interactive searching; evaluation of IR systems; query performance prediction.

    Issues/points of interaction:

    Evaluation of search systems: Ongoing research has demonstrated that there is no correlation between the most widely-used IR performance metrics and actual user search performance. New metrics that reflect users and their search behaviour should therefore be developed. To do this, the underlying assumptions of current IR evaluation need to be re-visited. This could include the direct observation and analysis of information-seeking behaviour in an online search environment, exploring how users learn while conducting searches, identifying the different types of search tasks that users actually engage in, and accounting for factors such as interface design. This type of work would benefit from the involvement of many different research areas, including the cognitive sciences, HCI and IR.

    Query performance prediction: If the effectiveness of a query can be estimated in advance, a search system could respond dynamically to situations where it is likely that a poor set of answers will be returned. Most predictors use statistical information about the distribution of terms within documents and a collection. Alternative sources of evidence that could help to improve prediction might include natural language and semantic features of query terms.

  25. Milad Shokouhi and Justin Zobel [presentation slides]

    Collection and Document Summarization In Distributed Text Retrieval Resources

    In distributed information retrieval (DIR) systems, users search multiple collections simultaneously by submitting their query to a single interface known as the broker. In a typical distributed retrieval scenario, first the broker compares the entered queries with available collection summaries. Then a few top collections with the most similar summaries are selected by the broker. Next, the query is passed to the selected collections and they return their results usually with a short snippet for each answer. Finally the returned documents from the selected collections are merged and presented to the user.

    Text-summarization techniques can be helpful in two of the discussed steps. To provide a summary for each collection, current techniques use a few sampled documents or a short description that has to be provided manually. Effective collection summaries improve the collection selection performance. In addition, document summaries (snippets) returned by the selected collections can be used for merging.

    The question is, how linguistic techniques can be used for producing effective summaries for both collections and documents.

  26. Andrew Smith [presentation slides]

    Bridging the Gap between What is Known and What is Unknown

    Consider a hypothetical universe of possible facts which are derivable from some text data set, and a hypothetical information seeker. Some of these facts may be known by the seeker, other facts may be known only by other individuals who have contributed to the data set, and other facts are not known by any individual contributor.

    Current search capability often focusses on retrieving a document which approximately matches a statement of some fact provided by the seeker - the query. If the seeker simply wants to retrieve a document they already believe exists, then the task reduces down to human recall: trying to remember what distinctive elements of the item would be best to use in the query (1).

    Beyond plain retrieval, the seeker may wish to find other facts which are unknown to them, but which are explicitly stated in some documents. This task is normally performed by formulating the best known query and browsing through retrieved documents which approximately match that query. Obviously, the success of this depends on: the initial level of knowledge of the seeker and their ability to formulate a query with appropriate vocabulary and specificity (2). Query refinement is an important method for the seeker here.

    If the seeker wishes to go further and discover facts which are not known by any text contributor, but are derivable from the data, the only option currently is for the seeker to: absorb the retrieved documents and synthesize the information using their own cognitive capabilities (3).

    Our research addresses problems (1), (2), and (3) above, by the provision of an Analyst Support System which employs unsupervised machine learning to construct a semantic map of the text data set. This mapping system employs both distributed and symbolic representations, and objects, attributes, and relationships in its model of the knowledge. This system, called Leximancer, addresses these three problems as follows: concept building by seeded thesaurus discovery to enhance recall while maintaining precision, concept profiling to reveal the semantic context of a target concept or a document set, and abductive reasoning in sematic space to discover hypotheses not present in any one document.

  27. Brad Starkie [presentation slides]

    Starkie Enterprises - Next Generation Search Research

    Research interests pertinent to workshop theme: Natural language Processing and Language Technology, Information Retrieval, Formal Semantics, Formal Syntax and Morphology.

    This presentation will briefly describe the question-answering product being developed at Starkie Enterprises. The presentation will describe the underlying philosophy of the research, the parsing technique & the knowledge representation used. Starkie Enterprises have developed a new method of inferring ultra-fast robust parsers from corpora ideally suited to parsing billions of web pages. A very brief introduction to the knowledge representation being investigated will be presented, along with an explanation of how it is ideally suited to the task of information retrieval, and how the information can be used by automated reasoning systems.

  28. Nicola Stokes [presentation slides]

    Research Interests:
    I'm interested in the development of robust linguistic analysis techniques (e.g. Lexical Cohesion Analysis, Textual Entailment and Paraphrase Identification, Toponym Resolution) for use in NLP and IR applications such as Ad hoc Retrieval, Text Summarisation, Question Answering and Text Classification. I am currently involved in the NICTA I2D2 project which is investigating the use of NLP techniques to enhance Geospatial Information Retrieval, i.e. improving retrieval results for queries containing references to place names.

    Abstract Title: A plea for more analysis and less metric tuning
    Recent results from research initiatives such as TREC and DUC suggest that as the required quality of a user response increases, IR techniques greatly benefit from NLP, e.g. changing a ranked list of passages to a factoid based answer. In contrast, little or no progress has been made in the application of NLP techniques to general ad hoc retrieval tasks. However, NLP may have a "niche" role to play in improving retrieval results for specific query types such as geospatial queries. Hence, it is the responsibility of the IR community to provide a more detailed failure analysis of their TREC-style experiments, i.e. which query types are they performing poorly on and why.

    Similarly the NLP community can be critised for focusing too much on automatic metrics scores and not enough on detailed analysis. As more and more open source NLP components are made available, researchers are building large pipeline architectures. Many of these systems are credited with improving QA and Summarisation performance over baseline approaches such as bag-of-words. However, as these systems usually comprise of a conglomerate of NLP components it is unclear which of the NLP techniques contributed most to the resultant improvement. If NLP researchers knew which components needed further development then additional gains may be possible. This suggests that we need to augment current application-based evaluation methodologies with component level analysis at different points in the NLP system pipeline.

  29. Paul Thomas [presentation slides]

    Future search tools will have to work with the tremendous range of online information available to users, including the entire Web but also corporate sources such as subscription services or databases and personal sources such as calendars or email archives. The standard centrally-indexed model of search will not work in these situations, which argues for a "metasearch" or distributed model; however this raises many challenges. How can we determine which of the possible data sources, as different as email archives or online databases, might have information the user needs? How can we rank, cluster, or otherwise display results from such different sources? Can we learn something about individual users, and if so can we generalise from this to other users? And how can we evaluate tools we might build?

    We have a prototype personal metasearch tool, which provides a testbed for experiments in these areas, and can report some preliminary results in using simple language models to select data sources and rank results.

  30. Yohannes Tsegay (RMIT), Andrew Turpin (RMIT), and Dave Hawking (CSIRO) [presentation slides]

    Snippet Extraction for Web Search Engine Result Lists

    Internet search engines attempt to present small summaries, or snippets, of retrieved documents in their result lists. Though considerable work has been done to improve the effectiveness and efficiency of retrieving relevant documents, little work has been done on snippet extraction.

    Existing search engines perform snippet generation by processing each retrieved document in its entirety from beginning to end, extracting and scoring sentences that contain query words. By re-ordering sentences in documents so that sentences more likely to appear in snippets are at the front of the document, the process can be made more efficient. Moreover, "bad" sentences can be omitted from the locally stored version of documents altogether. Our research focuses on this, and other techniques, for making snippet generation more efficient.

  31. Andrew Turpin [presentation slides]

    Dynamic Relevance Criteria During Search Andrew Turpin(RMIT) Most current information retrieval (IR) experiments follow the Cranfield model where there is a fixed collection of documents and queries, and a fixed set of relevance judgments for each query. That is, for each query a list of documents that have been judged on a (typically binary) scale between relevant and irrelevant to that query. However, in reality users change their relevance criteria as they read documents and learn more about a topic on which they are searching.

    In particular, when people read a typical results list of the type returned by Google, Yahoo, etc, their idea about what is relevant and what is not may change. This in turn will affect which link they select to view. This undermines the traditional IR evaluation methodology which assumes a fixed relevance judgment for each document a priori. I am interested in designing simple experiments to begin to characterise this behaviour, and in turn designing a methodology for comparing IR systems that will allow for dynamic relevance judgements. I will need help from other people at this forum.

  32. Sandra Uitdenbogerd [presentation slides]

    Readability-based search

    Most text retrieval research is based on the assumption that a user is looking for documents that are "about" a topic, or are a "known" website. However, not all searches are of that nature. A small body of research examines the problem of searching according to linguistic criteria. The typical user is assumed to be someone who is operating in or learning a second or foreign language.

    Our current work in the above field has examined the idea of retrieving text based on its readability as a foreign language. Most readability measures use simple statistics based on sentence length and estimates of vocabulary difficulty. We have found that the frequency of cognates has a small effect on perceived readability for foreign language learners. We are exploring techniques for automatically identifying cognates in text.

  33. Yefeng Wang [presentation slides]

    I'm working on projects that use natural language processing to process medical notes, such as pathological reports and clinical notes. Medical report, clinical notes contain a lot of noise, such as weak grammar, term variations/misspelling in core medical terms. These poor quality texts are difficult for general information system to understand. However, most of the analyses of medical and clinical notes require deep understanding of the text. We are interested in studying such noisy and minimal grammatical texts. The project aims to use NLP techniques extract useful information from unstructured clinical notes. At the first step, we focused on studying the medical terminologies. We built a system that index medical concepts in free text into a medical terminology (SNOMED CT) and transform in into knowledge representation. We developed a lexical base mapping algo-rithm and enhanced it with NLP techniques. The system is able to identify negation and qualification of concepts in medical notes.

    Medical terms in text convey a large amount of knowledge used in communication between clinicians. However medical terminology has a lot of variations, for example different preferred terms and synonyms. It is believed that coded information can improve information retrieval. Using a standard terminology / ontology to support encoding clinical information is critical. Indexing user query and medical concepts in text is a challenging problem. Another important issue is the standardisation of the terminology. Terminologies used in different countries, hospitals, sections in hospitals are different, which can cause problems in indexing the medical concepts. There is a need to map different terminologies into a standardised terminology that support information systems. However, it is impossible to express all possible terms in one terminology. Term composition and decomposition is a way to construct meaningful term and relationship for absent terms.

  34. Cecile Paris, Dave Hawking and Ross Wilkinson [presentation slides]

    "Bang for Buck" -- what context improves retrieval, and is it worth it?

    A key paper by Ragnar Nordliein Sigir '99 on ``"User revealment" -- a comparison of initial queries and ensuing question development in online searching and in human reference interactions'' -- demonstrated that context was discovered by reference librarians -- and that was the key to their success. At Sigir 05, a workshop on context revealed just how many of elements of context could be discovered. But which should be captured? The cost of capturing them all is beyond humans and machines, and the likely marginal benefit gradually reduces as we capture more context.

    Language technologies have been most typically applied to content, rather than context, and yet it can be "milked" in quite simple ways, quite powerfully. For example a query of the form "When..." rather than "What..." can trigger quite different retrieval responses and has been routinely used to considerable effect in QA systems. This simple example shows that language technologies can be useful tools for inferring context that is not typically available to bag of words approaches where the contextual clues may well be treated as stop words in a query. When proceeding more deeply, rather than looking at a single query, but a dialogue, each step in a dialogue can be used as a context revealment opportunity. Context can also improve the delivery of information -- context is used heavily in language generation, and in other work we look at improving retrieval delivery in the same way.

    We are interested in experiments that compare the cost of capture of context versus the benefit of that context -- there is typically a very high cost to the user if they have to provide explicit contextual disambiguation -- can we use language technologies to reduce the cost and increase the benefit of knowing the context?

  35. Yitao Zhang [presentation slides]

    This project aims to help clinical decision making by providing most up-to-date and relevant knowledge from fast-growing on-line repositories of clinical findings like MEDLINE and BioMed Central. We are trying to retrieve the most similar instances of case reports according to the unique clinical profile of the current patient, including symptoms, laboratory test results, treatments, and health histories. There are two important issues to be addressed in the project. Firstly, an information extraction system is required to identify text segments that contain information of clinical profiles of patient. The extracted text can be further normalised into template-like forms which provide an intermediate level of knowledge of clinical data. Secondly, retrieval of most relevant cases is more based on a combination of detailed clinical profiles rather than a simple keyword-based searching.

    We try to build a system that exploits full research articles rather than only abstracts which generally ignore detailed clinical histories of the diagnostic and treatment process, such as readings of laboratory tests, and descriptions of symptoms of patient. As a preliminary attempt, a novel mark-up tag set has been proposed to cover a wide variety of semantics in patient case reports. We have created a manually annotated corpus with 75 journal articles and 5,117 sentences as the starting point for learning our information extraction system. The next step of our project is to normalise the extracted text and to find a better way of matching relevant cases with similar clinical profiles.

  36. Ying Zhou [presentation slides]

    With the emergence of social websites such as flickrs and del.icio.us, average Internet users are becoming more and more active in contributing and sharing information on the web. Various social networks of Internet users are formed and evolved around those social websites. Investigating the effects of social networks on Internet users' information retrieval and contributing behaviour may shed light on the current search technology. Many social websites provide information regarding most popular tags, most recent news, recommendation lists based on friends' preferences, yet there is no clear evidence on how such information may influence users search behaviour.

    I am planning to conduct a series of controlled laboratory experiments to observe and study the users' information retrieval behaviour and the effect of explicit social network information. The research needs to draw on theories from individual decision making, social cognitive science to understand the decision process of choosing the keywords and ranking the results. The rapid adoption of web publishing tools like wikis and social software enables many Internet users play both the information provider and consumer roles. Understanding the mental process of those users and the social influence on the process may provide guidelines in managing and searching the information contributed mainly by the average Internet users.

  37. Justin Zobel [presentation slides]

    Directions for Search with Rich Language Processing

    Web-based search engines solve a narrow problem extremely well. With diverse sources of evidence (vast quantities of data, anchor text and link evidence, past queries, and implicit user judgements), the great majority of searches are accurately resolved using bag-of-words methods. Such success now extends to tasks such as image search and cross-lingual retrieval.

    However, there are many search-related tasks that could well benefit from richer approaches to text processing. These include question answering; query-based fact extraction; authorial support; searching tailored to reading level; directed clustering; identity disambiguation; and querying of data streams such as speech transcripts. Such investigations could concern data from the open web, but more profitably might focus on better-managed data such as that from a digital library, institutional website, or other special-purpose resource. Challenges include how to undertake such search tasks and how, algorithmically, to provide them efficiently on a large scale.

    The ongoing theme of our research has been algorithmic mechanisms for efficient, effective search. In this context we have over the last decade investigated a wide range of search enhancements such as use of phrases for collection browsing, spelling correction, bioinformatic search, cross-lingual information retrieval, speech retrieval, and authorship attribution. Techniques based on rich language processing are a natural extension to this work.