PhD dissertation. — Queensland University of Technology, 2010. — 237 p.
Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In noise-free environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased.
Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve human speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess enhancement performance for intelligibility not speech recognition, therefore making them sub-optimal ASR applications.
This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) introducing phase spectrum information to enable spectral subtraction in the complex frequency domain.
Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word sequence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data.
Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR.
Throughout the dissertation, consideration is given to practical implementation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware.
The techniques proposed in this dissertation were evaluated using the Australian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.
Automatic Speech Recognition
Speech Enhancement
ASR Evaluation Databases
Likelihood-Maximising Speech Enhancement for Robust ASR
LIMA Frameworks for In-Car Speech Recognition
The Use of Phase in Spectral Subtraction
FPGA Hardware Implementation of Spectral Subtraction
Conclusions and Future Research
A Derivation of the Jacobian Matrix for LIMA-Based Mel-Filterbank Noise Subtraction
B Supporting Results
C In-Car Speech Data in Changing In-Car Noise Conditions