Keynote Speakers

Speech Technology Route from Alchemy to Science

Hynek Hermansky

Abstract: Successful technological applications often push development of relevant scientific disciplines. Just as the invention of steam engine drove advances in understanding thermodynamics and the invention of telephony yielded advances in understanding hearing, the current successful applications of machine recognition of speech should follow the trend. Large amounts of speech data which are becoming available represent a trove of knowledge to be exploited. After all linguistic messages in human speech are formed in a way to be easily decoded by human cognitive system, so the successful machines designed for extraction of these messages could reveal important hearing knowledge for those who care to look for it. The talk looks briefly at history of attempts to recognize speech by machines, points to some instances where speech recognition technology might have taken wrong turns and to some current trends in successful machine designs which suggest opportunities for further advancements in our understanding of information extraction from speech.

Bio: Hynek Hermansky (F'01, SM'92. M'83, SM'78) received the Dr. Eng. Degree from the University of Tokyo, and Dipl. Ing. Degree from Brno University of Technology, Czech Republic. He is the Julian S. Smith Professor of Electrical Engineering and the Director of the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. He is also a Research Professor at the Brno University of Technology, Czech Republic. He is a Life Fellow of the Institute of Electrical and Electronic Engineers (IEEE) IEEE, and a Fellow of the International Speech Communication Association (ISCA), was twice an elected Member of the Board of ISCA, a Distinguished Lecturer for ISCA and for IEEE, is the recipient of the 2013 ISCA Medal for Scientific Achievement and the 2020 IEEE James L. Flanagan Speech and Audio Processing Award.

A Holistic Representation Toward Integrative AI

Xuedong David Huang

Abstract: Microsoft is on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. Our unique perspective is to explore the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the intersection of all three is XYZ-code -- a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. XYZ-code will enable the fulfillment of a long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pretrained AI models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today. With recent human performance on benchmarks in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning, we have seen strong signals toward the more ambitious aspiration to produce a leap in AI capabilities, with multisensory and multilingual learning that is closer in line with how humans learn and understand. XYZ-code is a foundational component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks.

Bio: Xuedong David Huang is a Microsoft Technical Fellow and Chief Technology Officer overseeing Azure AI Cognitive Services engineering and research, covering Microsoft's four AI pillars (Computer Vision, Speech, Natural Language, and Decision). He helps to bring the dream of making machines see, hear, speak, and understand human beings a reality with his strong passion for technology, innovation, and social responsibility.

He joined Microsoft to found the company's speech technology group in 1993. He helped to bring speech technology to the mass market by introducing Windows SAPI in 1995, Speech Server in 2004, and Azure Speech in 2015. He served as General Manager for MSR Incubation and as Chief Architect for Bing and Ads from 2004 to 2015. He helped Microsoft bring AI to a wide range of products as well as achieve multiple historical AI milestones in the journey, including a human parity milestone in speech recognition in 2016, a human parity milestone in machine translation in 2018, and a human parity milestone in conversational QnA in 2019. He has held a variety of responsibilities to advance the AI stack from deep learning infrastructure to enabling new AI experiences.

He has been recognized with the Alan Newell research excellence leadership medal (1992), the IEEE Best Paper Award (1993), the Asian American Engineer of the Year (2011), Wired Magazine's 25 Geniuses (2016), and AI World's Top 10 (2017). He co-authored two books: Hidden Markov Models for Speech Recognition, and Spoken Language Processing. He holds hundreds of patents that impacted billions of customers via Microsoft and 3rd party products and services.

He was on the faculty of Carnegie Mellon University from 1989 to 1993. He received his PhD, MS, and BS from the University of Edinburgh, Tsinghua University, and Hunan University respectively. He is a fellow of IEEE and ACM.

A Broad Perspective on Self Supervised Learning for Speech Recognition

Bhuvana Ramabhadran

Abstract: Supervised learning has been used extensively in speech and language processing over the years. However, as the demands for annotated data increase, the process becomes expensive, time-consuming and is prone to inconsistencies and bias in the annotations. To take advantage of the vast quantities of unlabeled data, semi-supervised and unsupervised learning has been used extensively in the literature. Self-supervised learning, first introduced in the field of computer vision, is used to refer to frameworks that learn labels or targets from the unlabeled input signal. In other words, self-supervised learning makes use of proxy supervised learning tasks, such as contrastive learning to identify specific parts of the signal that carry information, thereby helping the neural network model to learn robust representations. Recently, self-supervised approaches for speech and audio processing are beginning to gain popularity. These approaches cover a broad spectrum of training strategies that utilize unpaired text, unpaired audio and partial annotations. These training methods include unsupervised data generation, masking and reconstruction to learn invariant representations, task-agnostic representations, incorporation of multi-task learning and including other forms of contextual information. Thus, self-supervised learning allows for the efficient use of unlabeled data (audio and text) for speech processing. The first part of this talk will provide an overview of the successful self-supervised approaches in speech recognition. The second part will focus on methods to learn consistent representations from data through augmentation and regularization, consistent predictions from synthesized speech and learning from unlabeled text and audio.

Bio: Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM's world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as an elected member of the IEEE SPS Speech and Language Technical Committee (SLTC), for two terms since 2010 and as its elected Vice Chair and Chair (2014–2016). She has served as the Area Chair for ICASSP (2011–2018) and on the IEEE SPS conference board from 2017-2018 during which she also served as the conference board's liaison with the ICASSP organizing committees. She served on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015). She currently serves as the Chair of the IEEE Flanagan Award Committee and as the Regional Director-At-Large for Region 6 in the IEEE SPS Board of Governors, where she has been working on coordinating the work from all the associated IEEE chapters. She serves on the International Speech Communication Association (ISCA) board and has served as the area chair for Interspeech conference since 2012. In addition to organizing several workshops at ICML, HLT-NAACL and NeurIPS, she has also served as an adjunct professor at Columbia University, where she co-taught a graduate course on speech recognition. She has served as the (Co/-)Principal Investigator on several projects funded by the National Science Foundation, EU and iARPA, spanning speech recognition, information retrieval from spoken archives, keyword spotting in many languages. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on understanding neural networks and use of speech synthesis to improve core speech recognition performance.