Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko and Jonas Beskow
All from KTH Royal Institute of Technology
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input.
Video of the system presenting itself:
The paper is available here.
@inproceedings{alexanderson2020generating,
author = {Alexanderson, Simon and Sz\'{e}kely, \'{E}va and Henter, Gustav Eje and Kucherenko, Taras and Beskow, Jonas},
title = {Generating Coherent Spontaneous Speech and Gesture from Text},
year = {2020},
isbn = {9781450375863},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3383652.3423874},
doi = {10.1145/3383652.3423874},
booktitle = {Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents},
articleno = {1},
numpages = {3},
location = {Virtual Event, Scotland, UK},
series = {IVA '20}
}