View the Project on GitHub simonalexanderson/IVA2020

Generating coherent spontaneous speech and gesture from text

Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko and Jonas Beskow

All from KTH Royal Institute of Technology


Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input.


Video of the system presenting itself:


The paper is available here.


    author = {Alexanderson, Simon and Sz\'{e}kely, \'{E}va and Henter, Gustav Eje and Kucherenko, Taras and Beskow, Jonas},
    title = {Generating Coherent Spontaneous Speech and Gesture from Text},
    year = {2020},
    isbn = {9781450375863},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3383652.3423874},
    doi = {10.1145/3383652.3423874},
    booktitle = {Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents},
    articleno = {1},
    numpages = {3},
    location = {Virtual Event, Scotland, UK},
    series = {IVA '20}