Project | Paderborn University

Overview

The speech signal is a rich source of information that conveys not only linguistic but also extra/para-linguistic information, such as the speaker's identity, gender, emotional state, age, or the social status. However, those traits are hidden in complex, non-transparent variations of the speech signal, and mostly obscure to speech research. With recent progress in speech synthesis and voice conversion caused by the advent of deep learning, we argue that synthesized speech can become a valuable tool for research in phonetics. The overarching goal of this project is thus to explore the potential of deep generative modeling of speech as a tool to support basic research in phonetics. To constrain the task, we will not consider the synthesis of stimuli from text, but concentrate on the dedicated manipulation of speech to generate new speech signals with desired properties. The goal is to develop generative models which offer a representation of the speech signal by latent variables, which is compact and informative about the observed speech signal, which represents different sources of variation of the speech signal by different dimensions of the representation, which allows a dedicated manipulation of a phonetic cue along phonetically plausible dimensions, and which is amenable to human interpretation.