Biologically inspired methods in speech recognition and synthesis: closing the loop

PhD Thesis, 2016

Trevor Bekolay


Current state-of-the-art approaches to computational speech recognition and synthesis are based on statistical analyses of extremely large data sets. It is currently unknown how these methods relate to the methods that the human brain uses to perceive and produce speech. In this thesis, I present a conceptual model, Sermo, which describes some of the computations that the human brain uses to perceive and produce speech. I then implement three large-scale brain models that accomplish tasks theorized to be required by Sermo, drawing upon techniques in automatic speech recognition, articulatory speech synthesis, and computational neuroscience. The first model extracts features from an audio signal by performing a frequency decomposition with an auditory periphery model, then decorrelating the information in that power spectrum with methods commonly used in audio and image compression. I show that the features produced by this model implemented with biologically plausible spiking neurons can be used to classify phones in pre-segmented speech with significantly better accuracy than the features typically used in automatic speech recognition systems. Additionally, I show that this model can be used to compare auditory periphery models in terms of their ability to support phone classification of pre-segmented speech. The second model uses a symbol-like neural representation of a sequence of syllables to generate a trajectory of premotor commands that can be used to control an articulatory synthesizer. I show that the model can produce trajectories up to several seconds in length from a static syllable sequence representation that result in intelligible synthesized speech. The trajectories reflect the high temporal variability of human speech, and smoothly transition between successive syllables, even in rapid utterances. The third model classifies syllables from a trajectory of premotor commands. I show that the model is able to classify syllables online despite high temporal variability, and can produce the same syllable representations used by the second model. These two models can be connected in future work in order to implement a closed-loop sensorimotor speech system. Unlike current computational approaches, all three of these models are implemented with biologically plausible spiking neurons, which can be simulated with neuromorphic hardware, and can interface naturally with artificial cochleas. All models are shown to scale to the level of adult human vocabularies in terms of the neural resources required, though limitations on their performance as a result of scaling will be discussed.

Full text links


 External link


University of Waterloo
PhD thesis


Plain text