SLT 2021

Session: Multimodal Processing Join Zoom

Chair: Sandro Cumani, Md Sahidullah

Session: Spoken Language Understanding and Spoken Dialogue Systems Join Zoom

Chair: Seokhwan Kim, Chiori Hori

Session: Resources and Evaluation Join Zoom

Chair: Benoit Favre, Svetlana Stoyanchev

1189: END-TO-END LIP SYNCHRONISATION BASED ON PATTERN CLASSIFICATION

You Jin Kim, Hee Soo Heo, Soo-Whan Chung, Bong-Jin Lee

Paper and Video (new tab)

The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.

1045: PROTODA: EFFICIENT TRANSFER LEARNING FOR FEW-SHOT INTENT CLASSIFICATION

Manoj Kumar, Varun Kumar, Hadrien Glaude, Cyprien delichy, Aman Alok, Rahul Gupta

Paper and Video (new tab)

Practical sequence classification tasks in natural language processing often suffer from low training data availability for target classes. Recent works towards mitigating this problem have focused on transfer learning using embeddings pre-trained on often unrelated tasks, for instance, language modeling. We adopt an alternative approach by transfer learning on an ensemble of related tasks using prototypical networks under the meta-learning paradigm. Using intent classification as a case study, we demonstrate that increasing variability in training tasks can significantly improve classification performance. Further, we apply data augmentation in conjunction with meta-learning to reduce sampling bias. We make use of a conditional generator for data augmentation that is trained directly using the meta-learning objective and simultaneously with prototypical networks, hence ensuring that data augmentation is customized to the task. We explore augmentation in the sentence embedding space as well as prototypical embedding space. Combining meta-learning with augmentation provides upto 6.49% and 8.53% relative F1-score improvements over the best performing systems in the 5-shot and 10-shot learning, respectively.

1017: A NEW DATASET FOR NATURAL LANGUAGE UNDERSTANDING OF EXERCISE LOGS IN A FOOD AND FITNESS SPOKEN DIALOGUE SYSTEM

Maya Epps, Juan Uribe, Mandy Korpusik

Paper and Video (new tab)

Health and fitness are becoming increasingly important in the United States, as illustrated by the 70% of adults in the U.S. that are classified as overweight or obese, as well as globally, where obesity nearly tripled since 1975. Prior work used convolutional neural networks (CNNs) to understand a spoken sentence describing one’s meal, in order to expedite the meal-logging process. However, the system lacked a complementary exercise-logging component. We have created a new dataset of 3,000 natural language exercise-logging sentences. Each token was tagged as an Exercise, Feeling, or Other, and mapped to the most relevant exercise, as well as a score of how they felt on a scale from 1 to 10. We demonstrate the following: for intent detection (i.e., logging a meal or exercise), logistic regression achieves over 99% accuracy on a held-out test set; for semantic tagging, contextual embedding models achieve 93% F1 score, outperforming conditional random field models (CRFs); and recurrent neural networks (RNNs) trained on a multiclass classification task successfully map tagged exercise and feeling segments to database matches. By connecting how the user felt while exercising to the food they ate, in the future we may provide personalized and dynamic diet recommendations.

1204: END-TO-END SILENT SPEECH RECOGNITION WITH ACOUSTIC SENSING

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

Paper and Video (new tab)

Silent speech interfaces (SSI) has been an exciting area of recent interest. In this paper, we present a non-invasive silent speech interface that uses inaudible acoustic signals to capture people's lip movements when they speak. We exploit the speaker and microphone of the smartphone to emit signals and listen to their reflections, respectively. The extracted phase features of these reflections are fed into the deep learning networks to recognize speech. And we also propose an end-to-end recognition framework, which combines the CNN and attention-based encoder-decoder network. Evaluation results on a limited vocabulary (54 sentences) yield word error rates of 8.4% in speaker-independent and environment-independent settings, and 8.1% for unseen sentence testing.

1085: VIRAAL: VIRTUAL ADVERSARIAL ACTIVE LEARNING FOR NLU

Gregory Senay, Badr Youbi Idrissi, Marine Haziza

Paper and Video (new tab)

This paper presents VirAAL, an Active Learning framework based on Adversarial Training. VirAAL aims to reduce the effort of annotation in Natural Language Understanding (NLU). VirAAL is based on Virtual Adversarial Training (VAT), a semi-supervised approach that regularizes the model through Local Distributional Smoothness. With that, adversarial perturbations are added to the inputs making the posterior distribution more consistent. Therefore, entropy-based Active Learning becomes robust by querying more informative samples without requiring additional components. The first set of experiments studies the impact of an adapted VAT for joint-NLU tasks within low labeled data regimes. The second set shows the effect of VirAAL in an Active Learning (AL) process. Results demonstrate that VAT is robust even on multi-task training, where the adversarial noise is computed from multiple loss functions. Substantial improvements are observed with entropy-based AL with VirAAL for querying data to annotate. VirAAL is an inexpensive method in terms of AL computation with a positive impact on data sampling. Furthermore, VirAAL decreases annotations in AL up to 80% and shows improvements over existing data augmentation methods. The code is publicly available.

1067: TOWARDS LARGE-SCALE DATA ANNOTATION OF AUDIO FROM WEARABLES: VALIDATING ZOONIVERSE ANNOTATIONS OF INFANT VOCALIZATION TYPES

Chiara Semenzin, Lisa Hamrick, Amanda Seidl, Bridgette Kelleher, Alejandrina Cristia

Paper and Video (new tab)

Recent developments allow the collection of audio data from lightweight wearable devices, potentially enabling us to study language use from everyday life samples. However, extracting useful information from these data is currently impossible with automatized routines, and overly expensive with trained human annotators. We explore a strategy fit to the 21st century, relying on the collaboration of citizen scientists. A large dataset of infant speech was uploaded on a citizen science platform. The same data were annotated in the laboratory by highly trained annotators. We investigate whether crowd-sourced annotations are qualitatively and quantitatively comparable to those produced by expert annotators in a dataset of children at high- and low-risk for language disorders. Our results reveal that classification of individual vocalizations on Zooniverse was overall moderately accurate compared to the laboratory gold standard. The analysis of descriptors defined at the level of individual children found strong correlations between descriptors derived from Zooniverse versus laboratory annotations.

1374: Speaker-Independent Visual Speech Recognition with the Inception v3 Model

Timothy Israel Santos, Andrew Abel, Nick Wilson, Yan Xu

Paper and Video (new tab)

The natural process of understanding speech involves combining auditory and visual cues. CNN based lip reading systems have become very popular in recent years. However, many of these systems consider lipreading to be a black box problem, with limited detailed performance analysis. In this paper, we performed transfer learning by training the Inception v3 CNN model, which has pre-trained weights produced from IMAGENET, with the GRID corpus, delivering good speech recognition results, with 0.61 precision, 0.53 recall, and 0.51 F1-score. The lip reading model was able to automatically learn pertinent features, demonstrated using visualisation, and achieve speaker-independent results comparable to human lip readers on the GRID corpus. We also identify limitations that match those of humans, therefore limiting potential deep learning performance in real world situations.

1144: WARPED LANGUAGE MODELS FOR NOISE ROBUST LANGUAGE UNDERSTANDING

Mahdi Namazifar, Gokhan Tur, Dilek Hakkani-Tur

Paper and Video (new tab)

Masked Language Models (MLM) are self-supervised neural networks trained to fill in the blanks in a given sentence with masked tokens.Despite the tremendous success of MLMs for various text based tasks, they are not robust for spoken language understanding, especially for spontaneous conversational speech recognition noise. In this work we introduce Warped LanguageModels (WLM) in which input sentences at training time go through the same modifications as inMLM, plus two additional modifications, namely inserting and dropping random tokens. These two modifications extend and contract the sentence in addition to the modifications in MLMs, hence the word “warped” in the name. The insertion and drop modification of the input text during training of WLM resemble the types of noise due toAutomatic Speech Recognition (ASR) errors, and as a result WLMs are likely to be more robust to ASR noise. Through computational results we show that natural language understanding systems built on top of WLMs perform better compared to those built based on MLMs, especially in the presence of ASR errors.

1150: IDEA: AN ITALIAN DYSARTHRIC SPEECH DATABASE

Marco Marini, Mauro Viganò, Massimo Corbo, Marina Zettin, Gloria Simoncini, Bruno Fattori, Clelia D'Anna, Massimiliano Donati, Luca Fanucci

Paper and Video (new tab)

This paper describes IDEA a database of Italian dysarthric speech produced by 45 speakers affected by 8 different pathologies. Neurologic diagnoses were collected from the subjects’ medical records, while dysarthria assessment was conducted by a speech language pathologist and neurologist. The total number of records is 16794. The speech material consists of 211 isolated common words recorded by a single condenser microphone. The words that refer to an ambient assisted living scenario, have been selected to cover as widely as possible all Italian phonemes. The recordings, supervised by a speech pathologist, were recorded through the RECORDIA software that was developed specifically for this task. It allows multiple recording procedures depending on the patient severity and it includes an electronic record for storing patients’ clinical data. All the recordings in IDEA are annotated with a TextGrid file which defines the boundaries of the speech within the wave file and other types of notes about the record. This paper also includes preliminary experiments on the recorded data to train an automatic speech recognition system from a baseline Kaldi recipe. We trained HMM and DNN models and the results shows 11.75% and 14.99% of WER respectively.

1386: LISTEN, LOOK AND DELIBERATE: VISUAL CONTEXT-AWARE SPEECH RECOGNITION USING PRE-TRAINED TEXT-VIDEO REPRESENTATIONS

Shahram Ghorbani, Yashesh Gaur, Yu Shi, Jinyu Li

Paper and Video (new tab)

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. This architecture consists of separate encoders for the two modalities and a single decoder that attends over them. We show that this architecture is better than fusing modalities at the signal level. Additionally, we also explore leveraging the visual information in a second pass model, which has also been referred to as a `deliberation model'. The deliberation model accepts audio representations and text hypotheses from the first pass ASR and combines them with a visual stream for an improved visual context-aware recognition. The proposed deliberation scheme can work on top of any well trained ASR and also enabled us to leverage the pre-trained text model to ground the hypotheses with the visual features. Our experiments on HOW2 dataset show that multi-stream and deliberation architectures are very effective at the VC-ASR task. We evaluate the proposed models for two scenarios; clean audio stream and distorted audio in which we mask out some specific words in the audio. The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6% and 8.7% for the clean and masked data, respectively, compared to an audio-only model. The deliberation model also improves recovering the masked word by 59% relative.

1395: RNN BASED INCREMENTAL ONLINE SPOKEN LANGUAGE UNDERSTANDING

Prashanth Gurunath Shivakumar, Naveen Kumar, Panayiotis Georgiou, Shrikanth Narayanan

Paper and Video (new tab)

Spoken Language Understanding (SLU) typically comprises of an automatic speech recognition (ASR) followed by a natural language understanding (NLU) module. The two modules process signals in a blocking sequential fashion, i.e., the NLU often has to wait for the ASR to finish processing on an utterance basis, potentially leading to high latencies that render the spoken interaction less natural. In this paper, we propose recurrent neural network (RNN) based incremental processing towards the SLU task of intent detection. The proposed methodology offers lower latencies than a typical SLU system, without any significant reduction in system accuracy. We introduce and analyze different recurrent neural network architectures for incremental and online processing of the ASR transcripts and compare it to the existing offline systems. A lexical End-of-Sentence (EOS) detector is proposed for segmenting the stream of transcript into sentences for intent classification. Intent detection experiments are conducted on benchmark ATIS, Snips and Facebook's multilingual task oriented dialog datasets modified to emulate a continuous incremental stream of words with no utterance demarcation. We also analyze the prospects of early intent detection, before EOS, with our proposed system.

1160: EFFICIENT CORPUS DESIGN FOR WAKE-WORD DETECTION

Delowar Hossain, Yoshinao Sato

Paper and Video (new tab)

Wake-word detection is an indispensable technology for preventing virtual voice agents from being unintentionally triggered. Although various neural networks were proposed for wake-word detection, less attention has been paid to efficient corpus design, which we address in this study. For this purpose, we collected speech data via a crowdsourcing platform and evaluated the performance of several neural networks when different subsets of the corpus were used for training. The results reveal the following requirements for efficient corpus design to produce a lower misdetection rate: (1) short segments of continuous speech can be used as negative samples, but they are not as effective as random words; (2) utterances of ``adversarial'' words, i.e., phonetically similar words to a wake-word, contribute to improving performance significantly when they are used as negative samples; (3) it is preferable for individual speakers to provide both positive and negative samples; (4) increasing the number of speakers is better than increasing the number of repetitions of a wake-word by each speaker.

1281: ANALYSIS OF MULTIMODAL FEATURES FOR SPEAKING PROFICIENCY SCORING IN AN INTERVIEW DIALOGUE

Mao Saeki, Yoichi Matsuyama, Satoshi Kobashikawa, Tetsuji Ogawa, Tetsunori Kobayashi

Paper and Video (new tab)

This paper analyzes the effectiveness of different modal- ities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational compe- tence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi- modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effective- ness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modali- ties are effective in the order of lexical, acoustic, and visual features.

1224: A LIGHT TRANSFORMER FOR SPEECH-TO-INTENT APPLICATIONS

Pu Wang, Hugo Van hamme

Paper and Video (new tab)

Spoken language understanding (SLU) systems can make life more agreeable, safer (e.g. in a car) or can increase the independence of physically challenged users. However, due to the many sources of variation in speech, a well-trained system is hard to transfer to other conditions like a different language or to speech impaired users. A remedy is to design a user-taught SLU system that can learn fully from scratch from users’ demonstrations, which in turn requires that the system’s model quickly converges after only a few training samples. In this paper, we propose a light transformer structure by using a simplified relative position encoding with the goal to reduce the model size and improve efficiency. The light transformer works as an alternative speech encoder for an existing user-taught multitask SLU system. Experimental results on three datasets with challenging speech conditions prove our approach outperforms the existed system and other state-of-art models with half of the original model size and training time.

1253: IEEE SLT 2021 Alpha-mini Speech Challenge: Open Datasets, Tracks, Rules and Baselines

Yihui Fu, Zhuoyuan Yao, Weipeng He, Jian Wu, Xiong Wang, Zhanheng Yang, Shimin Zhang, Lei Xie, Dongyan Huang, Hui Bu, Petr Motlicek, Jean-Marc Odobez

Paper and Video (new tab)

The IEEE Spoken Language Technology Workshop (SLT) 2021 Alpha-mini Speech Challenge (ASC) is intended to improve research on keyword spotting (KWS) and sound source location (SSL) on humanoid robots. Many publications report significant improvements on deep learning based KWS and SSL on open source datasets in recent years. For deep learning model training, it is necessary to expand the data coverage to improve the robustness of model. Thus, simulating multi-channel noisy and reverberant data from single-channel speech, noise, echo and room impulsive response (RIR) is widely adopted. However, this approach may generate mismatch between simulated data and recorded data in real application scenarios, especially echo data. In this challenge, we open source a sizable speech, keyword, echo and noise corpus for promoting data-driven methods, particularly deep-learning approaches on KWS and SSL. We also choose Alpha-mini, a humanoid robot produced by UBTECH equipped with a build-in four-microphone array on its head, to record development and evaluation sets under the actual Alpha-mini robot application scenario, including noise as well as echo and mechanical noise generated by the robot itself for model evaluation. Furthermore, we illustrate the rules, evaluation methods and baselines for researchers to quickly assess their achievements and optimize their models.

1340: DETECTING EXPRESSIONS WITH MULTIMODAL TRANSFORMERS

Srinivas Parthasarathy, Shiva Sundaram

Paper and Video (new tab)

Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.

1298: META LEARNING TO CLASSIFY INTENT AND SLOT LABELS WITH NOISY FEW SHOT EXAMPLES

Shang-Wen Li, Jason Krone, Shuyan Dong, Yi Zhang, Yaser Al-onaizan

Paper and Video (new tab)

Recently deep learning has dominated many machine learn- ing areas, including spoken language understanding (SLU). However, deep learning models are notorious for being data- hungry, and the heavily optimized models are usually sen- sitive to the quality of the training examples provided and the consistency between training and inference conditions. To improve the performance of SLU models on task with noisy and low training resources, we propose a new SLU benchmarking task: few-shot robust SLU, where SLU com- prises two core problems, intent classification (IC) and slot labeling (SL). We establish the task by defining few-shot splits on three public IC/SL datasets, ATIS, SNIPS, and TOP, and adding two types of natural noises (adaptation example missing/replacing and modality mismatch) to the splits. We further propose a novel noise-robust few-shot SLU model based on prototypical networks. We show the model consistently outperforms conventional fine-tuning baseline and another popular meta-learning method, Model-Agnostic Meta-Learning (MAML), in terms of achieving better IC ac- curacy and SL F1, and yielding smaller performance variation when noises are present.

1273: TAL: A SYNCHRONISED MULTI-SPEAKER CORPUS OF ULTRASOUND TONGUE IMAGING, AUDIO, AND LIP VIDEOS

Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, Steve Renals

Paper and Video (new tab)

We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech. This paper describes the corpus and presents benchmark results for the tasks of speech recognition, speech synthesis (articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Paper 1189: (1)

Paper 1204: (2)

Paper 1374: (3)

Paper 1386: (4)

Paper 1281: (5)

Paper 1340: (6)

1104: LARGE-CONTEXT CONVERSATIONAL REPRESENTATION LEARNING: SELF-SUPERVISED LEARNING FOR CONVERSATIONAL DOCUMENTS

Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

Paper and Video (new tab)

This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequential labeling is the difficulty of collecting labeled conversational documents, as manual annotations are very costly. To deal with this issue, we propose large-context conversational representation learning (LC-CRL), a self-supervised learning method specialized for conversational documents. A self-supervised learning task in LC-CRL involves the estimation of an utterance using all the surrounding utterances based on large-context language modeling. In this way, LC-CRL enables us to effectively utilize unlabeled conversational documents and thereby enhances the utterance-level sequential labeling. The results of experiments on scene segmentation tasks using contact center conversational datasets demonstrate the effectiveness of the proposed method.

1414: THE SLT 2021 CHILDREN SPEECH RECOGNITION CHALLENGE: OPEN DATASETS, RULES AND BASELINES

Fan Yu, Zhuoyuan Yao, Xiong Wang, Keyu An, Lei Xie, Zhijian Ou, Bo Liu, Xiulin Li, Guanqiong Miao

Paper and Video (new tab)

Automatic speech recognition (ASR) has been significantly advanced with the use of deep learning and big data. However improving robustness, including achieving equally good performance on diverse speakers and accents, is still a challenging problem. In particular, the performance of children speech recognition (CSR) still lags behind due to 1) the speech and language characteristics of children's voice are substantially different from those of adults and 2) sizable open dataset for children speech is still not available in the research community. To address these problems, we launch the Children Speech Recognition Challenge (CSRC), as a flagship satellite event of IEEE SLT 2021 workshop. The challenge will release about 400 hours of Mandarin speech data for registered teams and set up two challenge tracks and provide a common testbed to benchmark the CSR performance. In this paper, we introduce the datasets, rules, evaluation method as well as baselines.

1358: USING PARALINGUISTIC INFORMATION TO DISAMBIGUATE USER INTENTIONS FOR DISTINGUISHING PHRASE STRUCTURE AND SARCASM IN SPOKEN DIALOG SYSTEMS

Zhengyu Zhou, In Gyu Choi, Yongliang He, Vikas Yadav, Chin-Hui Lee

Paper and Video (new tab)

This paper aims at utilizing paralinguistic information usually hidden in speech signals, such as pitch, short pause and sarcasm, to disambiguate user intention not easily distinguishable from speech recognition and natural language understanding results provided by a state-of-the-art spoken dialog system (SDS). We propose two methods to address the ambiguities in understanding name entities and sentence structures based on relevant speech cues and nuances. We also propose an approach to capturing sarcasm in speech and generating sarcasm-sensitive responses using an end-to-end neural network. An SDS prototype that directly feeds signal information into the understanding and response generation components has also been developed to support the three proposed applications. We have achieved encouraging experimental results in this initial study, demonstrating the potential of this new research direction.

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Paper 1017: (1)

Paper 1067: (2)

Paper 1150: (3)

Paper 1160: (4)

Paper 1253: (5)

Paper 1273: (6)

Paper 1414: (7)

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Paper 1045: (1)

Paper 1085: (2)

Paper 1144: (3)