Learning the building blocks of speech - Features

One-line summary
Use image recognition techniques to recognize speech as well as deep learning does.

Research question:
Can we use Gabor filters to extract building blocks of speech?

Human speech is unique in that it uses a small set of basic speech sounds and recombines them into an unlimited number of utterances. Interestingly, the basic set of speech sounds and the rules for combining them are different for every language. Humans can learn this effortlessly, but no other animal appears to be able to do this (Yip, 2006).
The proposed project is part of a larger project to understand what cognitive mechanisms have evolved to deal with learning complex speech. Part of the project consists of building learning computer models to investigate from the bottom up what mechanisms are needed to solve the problem of learning speech.

Outline of the project:
In standard speech recognition, speech is first analysed in terms of short-term spectral features (usually based on Mel-Weighted Frequency Cepstral Coefficients) and then one tries to recognize sequences in this data using statistical methods (usually based on Hidden Markov Models). As these methods reach their limits, it is realized that for proper recognition, we need to look at longer stretches of speech. This is actually one of the things that deep learning models do inditectly. The proposed project consists of trying to use features that look at larger stretches of speech directly. These techniques consider the spectral and temporal dimensions simultaneously, in fact converting the signal into a 2-D image and using techniques from image recognition (Gabor filters for instance, Ezzat et a. 2007) to find patterns. This project will explore such 2-D speech features.

The student:
We are looking for a student who wants to do some serious programming, but who is also able to design and run an experiment with the models. As the basic questions have to do with human cognition and language, an interest in these topics is important as well.

Ezzat, T., Bouvrie, J. V., & Poggio, T. (2007). Spectro-temporal analysis of speech using 2-D Gabor filters. (pp. 506–509). Presented at Interspeech.
Yip, M. J. (2006). The search for phonology in other species. Trends in cognitive sciences, 10(10), 442-446.

Please contact Bart de Boer for more information regarding this project.