Speech Technology

Speech technology is an area of research that focuses on both the analytical aspect and implementation of human speech-related applications. The mainstream of speech technology can be classified into spoken language research and signal processing. Spoken language processing is a discipline most concerned with the production, perception, description learning and evolution of human speech. The field of signal processing provides the tools to achieve adequate analytical descriptions of speech or its signal.

Research in speech technology covers diverse areas including:

  • Speech Science: Physiology and Production

  • Speech Analysis, Processing and Enhancement

  • Spoken Language Systems: Recognition and Synthesis

  • Speech Coding and Voice Quality Evaluation

  • Spoken Language Modeling and Understanding

Research in speech production investigates and models the physiology of the human speech apparatus. This includes studies of vocal-fold, x-ray microbeam observation of laryngeal control, modelling the control mechanism of vocal cord vibration, and modelling the articulatory motion in the oral cavity.

Speech analysis covers various speech analytical methods based on acoustic and articulatory features as well as perceptual analysis of speaker identity.

Speech synthesis systems are usually grouped into two classes: concept-to-speech systems and text-to-speech systems (TTS). Concept-to-speech systems synthesize speech departing from semantic or pragmatic concepts and have full knowledge of the purpose and meaning of the intended utterances, such as in response to user queries in automatic dialog systems. TTS systems are used as aids for people with vision or speech impairment as well as in automatic information services such as reading weather forecasts, news or telephony systems.

Speech coding is important in telecommunication services such as audio conferencing and mobile communication. Speech coding is the compression of speech for transmission with speech codec that use audio signal and speech processing techniques. Most techniques used in speech coding utilize the knowledge in psychoacoustics to transmit only the data that is relevant to the human auditory system.

Speech recognition technology has been deployed in diverse applications such as commercial products (cochlear implants, dictation systems, call centre operations) and the telecommunication industry. Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, into a set of words. The recognized words can be the final output or serve as the input to further linguistic processing. Current research focuses on schemes to improve the robustness of recognition systems in adverse environments. Another area of study similar to speech recognition is speaker recognition and authentication.

Speech enhancement is an area of research that is concerned with alleviating or suppressing noise to improve the perceptual aspects of speech for the human listener or to enhance the signal for better utilization by other speech processing algorithms. Beam-forming techniques using microphone arrays have been extensively used for this purpose.

Language models help a speech recognizer to figure out the likelihood of a word sequence, independent of the acoustics. This allows the system to make the right guess when two different sentences sound the same. Language modelling is useful for speech recognition, hand-writing recognition and spelling correction. The basic principle for spoken language understanding (SLU) is to link linguistic expressions to concrete real-world entities. A SLU system is needed to interpret utterances in context and carry out appropriate actions. Dialog management is the central component that communicates with the application and SLU modules, such as discourse analysis, sentence interpretation, and response generation.

It can be expected that over the next decade, major efforts will be launched to produce competing products and improve the robustness and quality of speech technology software. There will be developments in combining disparate components of speech technology with other modes of human-computer interaction to form a unified, consistent computing environment. In addition, an emphasis will be placed on multilingual systems for the international market.

Recommended References

Xuedong Huang, Alex Acero and Hsiao-Wuen Hon, “Spoken Language Processing: A Guide to Theory, Algorithm and System Development,” Prentice Hall 2001.

Ben Gold and Nelson Morgan, “Speech and Audio Signal Processing: Processing and Perception of Speech and Music,” John Wiley and Sons, 2000

Recommended Journals

Speech Communication, IEEE Transactions on Speech and Audio Processing

Summary Written By

Toh Aik Ming

Centre for Intelligent Information Processing Systems (CIIPS)

School of Electrical, Electronic and Computer Engineering

The University of Western Australia.