Prof Johan Sundberg
Department of Speech Music Hearing,KTH Stockholm
jsu@csc.kth.se
Time: Wed 9.00am
Venue: Common Room
Music can be regarded as acoustic communication with aesthetical and emotional qualities. One of its most remarkable properties is its universal impact; almost all individuals in the world take initiatives that bring them to the opportunity of listening to music. This suggests that, for some reason, the acoustic code used in musical communication is commonly understood. This presentation will focus on observations on singers´ communication based on experiences from attempts to synthesise singing. If merely the nominal information given by the score is converted into singing, a performance emerges that evokes the disturbing impression that the singer is not interested at all in the song performed. This indicates the importance of the performers’ contributions to music communication. Some examples will be given of what it is that is being communicated in music and how this communication is achieved.
One example of singers’ communication concerns phrasing. Like speech, traditional music has a hierarchical structure. Individual notes belong together, forming motifs, which belong together, forming subphrases, which belong together, forming phrases. Performers mark this structure by speeding up the tempo in the beginning of a phrase and slowing it down in the end. In this way they enhance the structure, and the code used to mark phrases in music performances is similar to the code used for the same purpose in speech. Another example of what singers communicate concerns the phenomenon of emphasis, also occurring both in music and speech. In instrumental music it is closely related to expectedness, such that unexpected tones are emphasised. Singers tend to emphasise also semantically important words in the lyrics. Again, striking code similarities can be found between singing and speech. Similarity is another factor relevant to singers’ communication. For example, it seems important that a singer’s voice shows similarities with the speech of a person feeling tenderness when the lyrics have an ambiance of tenderness. The importance of such similarities to singers’ communication is demonstrated with a case where the similarity goes in the wrong direction.
Faster than the speed of sight: Temporal constraints on perception and cognition
Dr Geoff Stuart
Department of Psychology
University of Melbourne; Defence Science and Technology Organisation (DSTO)
Geoff.Stuart@dsto.defence.gov.au
Time: Thurs 9.10am
Venue: McDonald Wing
A remarkable property of human perception and cognition is that it appears to happen in “real-time”. Modern neuroscience tells us that an enormous amount of processing occurs between the sense organs and the higher centres in the brain that allow us to perceive visual objects, or the meanings conveyed by spoken and written language. Yet subjectively this process is seemingly effortless and instantaneous. Compared to modern digital computers, individual neurons are quite slow in their rate of information transmission, represented as spike trains. According to some theorists, this means that computation in nervous systems must be largely feedforward, and even then it cannot rely on rate codes. Conversely, it has been argued that perception and cognition in many instances relies on top-down or feedback-based processing, in order to solve problems of object binding and segmentation. However, there does not appear to be time for this. Processes of spatial and temporal attention may be crucial to the resolution of this apparent dilemma.
Human-Computer Interaction based on Acoustic Signals, Muscle Movements, and Brainwaves
Dr Tanja Schultz
Advanced Communication Technologies, Language Technologies Institute
Carnegie Mellon University
tanja@cs.cmu.edu
Time: Fri 9.00am
Venue: Mc Donald Wing
Our modern information society increasingly demands ubiquitous mobile computing systems that allow its users seamless interaction with everybody, everything, everywhere, at any time. Speech-driven interfaces have a lot to offer in this arena since speech is the most natural and efficient human communication modality. We can speak even if our hands and eyes are busy with other tasks and there is no need for bulky keyboards, making speech the ideal input mode for small handheld devices. Unfortunately, so far the effectiveness of currently available mobile digital devices is rather discouraging. It is my believe that this results from two major shortcomings, (1) privacy and robustness issues when it comes to speech recognition in public spaces and under noisy conditions and (2) the lack of situational aware devices which sense the environment and the users' needs.
In my talk I will present our ongoing research to overcome these challenges. In the first part I will talk about speech-driven interfaces that allow for noise-robust speech input using bone-conducting signal transmission, as well as input methods that capture articulatory muscle movements from electrodes attached to the face of the speaker. I will demonstrate our prototype system that recognizes speech by interpreting the electrical currents resulting from facial muscle contractions. Since this approach does not rely on acoustic signals, we can recognize speech even if it is only mouthed rather than spoken audibly. In the second part of the talk I will present computer interfaces that use brain activity to determine users' needs. Currently, our research in this area focuses on the users' attention, the task activity, and cognitive workload. Initial experiments indicate that brain signals can also be applied to recognize speech, even if it is only formulated in our thoughts rather than being spoken. We hope that these research and development efforts will lead to a new generation of human centered systems, which are completely aware of the users' situation and provide robust input mechanisms, preserving the users' privacy while silently interacting with machines and other humans.
Robust Multimodal Understanding for Interactive Systems
Michael Johnston
AT&T Labs Research, New Jersey
johnston@research.att.com
Time: Thurs 4.30pm
Venue: Common Room
The ongoing convergence of the web with telephony, driven by technologies such as voice over IP, high-speed mobile data networks, and hand-held computers and smartphones, enables widespread deployment of multimodal interfaces which combine graphical user interfaces with natural human input modalities such as speech and pen. In order to support effective multimodal interaction, natural language processing techniques, which have typically been applied to linear sequences of speech or text, need to be extended to support integration and understanding of multimodal language distributed over multiple different simultaneous input modes.
Multimodal grammars (Johnston and Bangalore 2000) combine speech and gesture parsing, integration, and understanding all within a single formalism. Their finite-state implementation enables efficient processing of lattice input from speech and gesture recognition and mutual compensation for errors and ambiguities. However, like other approaches based on hand-crafted rules, multimodal grammars can be brittle with respect to unexpected, erroneous, or disfluent input.
In this talk, I will illustrate and evaluate the use of multimodal grammars to support spoken input combined with complex freehand pen input in the context of a multimodal conversational system, and explore a range of methods for improving their robustness. These include techniques for building effective language models for speech recognition when little or no training data is available and techniques for robust multimodal understanding that draw on classification, machine translation, and sequence edit methods.
On the continuity of perception and action: Evidence from masked priming and kinematic analysis
Dr Matthew Finkbeiner
MACCS, Macquarie University
mfinkbei@maccs.mq.edu.au
Time: Thurs 1.30pm
Venue: Mc Donald Wing
A central issue in the work on perception and action has to do with the “continuity hypothesis”: is the engagement of the motor system continuous with perceptual processing? Or does one precede the other? I will present a recent study in which participants were asked to point to a red square on the left upon seeing a word like BLOOD and a green square on the right upon seeing a word like SPINACH. Unbeknownst to the participants, the target words were preceded by the prime words “red” or “green”. We found that the curvature of participants’ pointing trajectories was much greater following incongruent primes (green – BLOOD) than it was following congruent or neutral primes, indicating that participants responded initially to the prime and then corrected their response mid-flight. We take these findings to suggest that the processing of masked orthographic information is contiguous with the latest stages of response activation and preparation.
The impact of vocabulary structure on spoken word recognition
Prof Anne Cutler
Max-Planck-Institute for Psycholinguistics
Nijmegen, Netherlands
Anne.Cutler@mpi.nl
Time: Fri 2.00pm
Venue: Mc Donald Wing
Language-specific differences in the size and distribution of the phonemic repertoire have implications for the task facing listeners in recognising spoken words. A language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals. Comparative analyses of the vocabularies of English (many phonemes) and Spanish (fewer phonemes) confirm that this is the case. However, languages have ways of compensating for increase in spurious embedding. For example, it is known from word-recognition experiments that there are cross-language differences in listeners' use of stress information in lexical recognition; Spanish listeners make more use of stress than English listeners do.
Further vocabulary comparisons suggest that this is because considering stress in spoken-word recognition allows rejection of more unwanted competition from embedded words in Spanish than in English. For Dutch, the word recognition results somewhat surprisingly resemble those from Spanish more than those from English; the solution to this puzzle is again found in the vocabulary statistics which reveal that the reduction of embeddings resulting from consideration of stress in Dutch more closely resembles the reduction achieved in Spanish than in English. Thus the vocabulary structure of Spanish and Dutch induces listeners to use suprasegmental as well as segmental information in identifying words, while the vocabulary structure of English allows listeners to ignore suprasegmental information in lexical processing.