Abstract: With recent progress in deep learning, there has been an increased interest in visually grounded dialogue, which requires an AI agent to hold a meaningful conversation with humans in Natural Language about visual content in other modalities, in our case pictures. In this talk, I will discuss techniques for improved context modelling in visually grounded dialogue. We show that different types of context encodings are relevant for different multimodal tasks and datasets. In particular, we focus on response generation for task-based multimodal search and response selection for image-grounded open-domain conversations. For these tasks, I present new models for context encoding, including knowledge grounding, encoding history and context-extended multimodal fusion. However, despite achieving state-of-the-art results, we demonstrate several shortcomings of widely used models, tasks and datasets.
Bio: Verena Rieser is a professor at Heriot-Watt University where she leads research on Natural Language Generation and Spoken Dialogue Systems. She is also a co-founder of the Conversational AI company Alana. Verena was recently awarded a Leverhulme Trust Senior Research Fellowship by the Royal Society and is PI of several funded research projects and industry awards.
Verena's team has twice entered the prestigious Amazon Alexa Challenge as one of the three finalists two years in a row. Verena's current interests include Ethics for Open-domain systems, Data-to-Text and Text-to-Text generation, and Multimodal Dialogue.
Abstract: Voice conversion is a technique for modifying speech waveforms to convert non-/paralinguistic information into any form we want while preserving linguistic content. It has been dramatically improved thanks to significant progress of deep learning techniques as well as significant efforts to develop freely available resources, expanding the possibility of developing various applications beyond traditional speaker conversion. In this talk, I will review recent progress of voice conversion techniques, overviewing recent research activities including Voice Conversion Challenges, and then, I will also discuss possible future directions of voice conversion research.
Bio: Tomoki Toda is a Professor of the Information Technology Center at Nagoya University, Japan. He received the B.E. degree from Nagoya University, Japan, in 1999, and the D.E. degree from the Nara Institute of Science and Technology, Japan, in 2003. His research interests include statistical approaches to speech, music, and environmental sound processing. He has served as an Associate Editor of the IEEE Signal Processing Letter since 2016. He was a member of the Speech and Language Technical Committee of the IEEE SPS from 2007 to 2009, and from 2014 to 2016. He received the IEEE SPS 2009 Young Author Best Paper Award and the 2013 EURASIP-ISCA Best Paper Award (Speech Communication Journal).
Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.