LeVI (learning virtual infant) - Vocal tract model demo.

by Heikki Rasilo (Aalto University + Vrije Universiteit Brussel)

Old Safari blocks! Currently tested mainly with Chrome, smartphone should work as well. Sometimes the vocal tract is not drawn, then refresh and hope for the best.

Drag the dots around and listen to the resulting vocalization.

Update August 2024: Publication of the imitation feature accepted to VIHAR2024. See the paper here. The actor and critic network architectures here.
Update June 2024: Improved imitation performance. Bug fix in articulatory trajectory calculation.
Update May 2024: Possibility to imitate recorded input. Uses a tensorflow-js model trained to invert audio to articulation. Not very accurate for now, long vowels ok, rhythm ok. Improvements to come.
Update June 2022: Change the pitch, and start random movements of the vocal tract parameters.
Update May 2022: Start continuous glottal excitation with the button and babble dynamic sounds. Thanks to Yannick Jadoul for making a clickless audio-loop work!

This vocal tract model was developed in Matlab, coded to c++ using Matlab Coder. The C++ code was then turned into WebAssembly to run on the browser. The documentation of how to get this conversion done can be found here.

The purpose of this demo was to see how such code can be run easily in a web-browser in an interactice way.

Thanks to Yannick Jadoul, Bart de Boer, and Nick Verlinden for helping with some problems during the conversion.

Publications where this model has been used:

Feedback and imitation by a caregiver guides a virtual infant to learn native phonemes and the skill of speech inversion
Heikki Rasilo, Okko Räsänen, Unto K. Laine
Speech Communication, 55(9), 2013

In this study, two agents equipped with the vocal tract model interact. The "caregiver" model knows the Finnish phonetic system (preprogrammed articulations in the model), and is thus able to evaluate productions by an infant model (thus the name LeVI - Learning Virtual Infant). The infant model starts from zero, starts babbling and adjusting its vocalizations based on feedback, and imitations by the caregiver model. The infant model learns to imitate and recognize the phonetic system of the caregiver during learning.

An online model for vowel imitation learning
Heikki Rasilo, Okko Räsänen
Speech Communication 86, 2017

Here the infant model explores its articulatory space and produces static vowel sounds. Human participants are asked to classify the produced vowels in Finnish vowel categories. Later, participants are asked to vocally respond to vowels produced by the model by using Finnish words that include the perceived vowel, but also additional phonemes. The publication describes an algorithm, that allows the infant agent to learn to imitate the human participants' vowel sounds, despite the ambiguous and noisy learning environment.

Discovering Articulatory Speech Targets from Synthesized Random Babble
H Rasilo, Y Jadoul
Interspeech 2020, 3715-3719

A method to explore the vast 9-dimensional articulatory space, and find articulations that produce a high variability of different sounds. As an example, vocalizations such as the vowel /i/ is found from a relatively small part of the full articulation space (try to find this with the model!), when compared to vowel sounds such as /ae/. Random sampling of the 9 parameters finds e.g vowel /i/ very rarely. This paper introduces an algorithm that speeds up exploration of the articulatory space, not limited to only static vowels, but also to dynamic consonant sounds.

Phonemic learning based on articulatory-acoustic speech representations
Heikki Rasilo
Cognitive Science conference 2020

Human children learn to understand speech and to speak (physically produce articulations) simultaneously. Some studies and theories suggest that knowledge of speech production affects our perception of speech as well. In this study I train a speech recognizer on a Finnish speech database, and compare its performance to a system where both the articulation (represented by this model) and acoustic models for speech sounds are learned simultaneously. This study indicates that articulatory learning can shape representations of speech sounds so that they lead to better speech recognition performance as well.