Platinum Sponsors

Gold Sponsors

Silver Sponsors

IEEE SLT 2021 Online Conference

Jan 20, Wed

You can join a session by clicking the session title. Clicking any paper title will show the abstract, and the link to paper and video.
Only users who registered for SLT 2021 can access paper, video and Zoom - to register, click here.
Session: Natural Language Processing Join Zoom

Chair: Yun-Nung (Vivian) Chen, Alexandros Papangelis


Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

Paper and Video (new tab)

We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applications. It is also a problem that the offline GSS is an utterance-wise algorithm so that it produces latency according to the length of the utterance. With the proposed algorithm, block-wise input samples and corresponding time annotations are concatenated with those in the preceding context and used to update the parameters. Using the context enables the algorithm to estimate time-frequency masks accurately only from one iteration of optimization for each block, and its latency does not depend on the utterance length but predetermined block length. It also reduces calculation cost by updating only the parameters of active speakers in each block and its context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the proposed algorithm achieved almost the same performance as the conventional offline GSS algorithm but with 32x faster calculation, which is sufficient for real-time applications.


Christiaan Jacobs, Yevgen Matusevych, Herman Kamper

Paper and Video (new tab)

Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsupervised and multilingual transfer settings. Firstly, we show that terms from an unsupervised term discovery system can be used for contrastive self-supervision, resulting in improvements over previous unsupervised monolingual AWE models. Secondly, we consider how multilingual AWE models can be adapted to a specific zero-resource language using discovered terms. We find that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages.


Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan, Behnam Hedayatnia, Pei Zhou, Dilek Hakkani-Tur

Paper and Video (new tab)

Pretrained language models have demonstrated outstanding performance in many NLP tasks recently. However, their social intelligence, which requires commonsense reasoning about the current situation and mental states of others, is still developing. Towards improving language models' social intelligence, this paper focus on the Social IQA dataset, a task requiring social and emotional commonsense reasoning. Building on top of the pretrained RoBERTa and GPT2 models, we propose several architecture variations and extensions, as well as leveraging external commonsense corpora, to optimize the model for Social IQA. Our proposed system achieves competitive results as those top-ranking models on the leaderboard. This paper studies the strengths of pretrained language models, and provides viable ways to improve their performance for a particular task.

1200: Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

Paper and Video (new tab)

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.


Lisa van Staden, Herman Kamper

Paper and Video (new tab)

Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the \textit{segment level}, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder~(CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs. These are used as inputs to a recurrent CAE-based AWE model. In a word discrimination task on English and Xitsonga data, all three representation learning approaches outperform MFCCs, with CPC consistently showing the biggest improvement. In cross-lingual experiments we find that CPC features trained on English can also be transferred to Xitsonga.


ZEXIN LU, Jing Li, Yingyi Zhang, Haisong Zhang

Paper and Video (new tab)

This paper presents a predictive study on the progress of conversations. Specifically, we estimate the residual life for conversations, which is defined as the count of new turns to occur in a conversation thread. While most previous work focus on coarse-grained estimation that classifies the number of coming turns into two categories, we study fine-grained categorization for varying lengths of residual life. To this end,we propose a hierarchical neural model that jointly explores indicative representations from the content in turns and the structure of conversations in an end-to-end manner. Extensive experiments on both human-human and human-machine conversations demonstrate the superiority of our proposed model and its potential helpfulness in chatbot response selection.

1201: Deep Shallow Fusion for RNN-T Personalization

Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael Seltzer

Paper and Video (new tab)

End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the lack of external language models and difficulties in recognizing rare long-tail words, specifically entity names. In this work, we present novel techniques to improve RNN-T's ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement compared to a strong RNN-T baseline which uses shallow fusion and text-to-speech augmentation. Our work helps push the boundary of RNN-T personalization and close the gap with hybrid systems on use cases where biasing and entity recognition are crucial.


Yushi Hu, Shane Settle, Karen Livescu

Paper and Video (new tab)

Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low-or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW). Recent work has found that methods based on acoustic word embeddings (AWEs) can improve both performance and search speed. However, prior work on AWE-based QbE has primarily focused on English data and with single-word queries. In this work, we generalize AWE training to spans of words, producing acoustic span embeddings (ASE), and explore the application of ASE to QbE with arbitrary-length queries in multiple unseen languages. We consider the commonly used setting where we have access to labeled data in other languages (in our case, several low-resource languages) distinct from the unseen test languages. We evaluate our approach on the QUESST2015 QbE tasks, finding that multilingual ASE-based search is much faster than DTW-based search and outperforms the best previously published results on this task.


Hiroaki Takatsu, Mayu Okuda, Yoichi Matsuyama, Hiroshi Honda, Shinya Fujie, Tetsunori Kobayashi

Paper and Video (new tab)

In modern society, people's interests and preferences are diversifying. Along with this, the demand for personalized summarization technology is increasing. In this study, we propose a method for generating summaries tailored to each user's interests using profile features obtained from questionnaires administered to users of our spoken-dialogue news delivery system. We propose a method that collects and uses the obtained user profile features to generate a summary tailored to each user's interests, specifically, the sentence features obtained by BERT and user profile features obtained from the questionnaire result. In addition, we propose a method for extracting sentences by solving an integer linear programming problem that considers redundancy and context coherence, using the degree of interest in sentences estimated by the model. The results of our experiments confirmed that summaries generated based on the degree of interest in sentences estimated using user profile information can transmit information more efficiently than summaries based solely on the importance of sentences.


Dan Oneață, Alexandru Caranica, Adriana Stan, Horia Cucu

Paper and Video (new tab)

Quantifying the confidence (or conversely the uncertainty) of a prediction is a highly desirable trait of an automatic system, as it improves the robustness and usefulness in downstream tasks. In this paper we investigate confidence estimation for end-to-end automatic speech recognition (ASR). Previous work has addressed confidence measures for lattice-based ASR, while current machine learning research mostly focuses on confidence measures for unstructured deep learning. However, as the ASR systems are increasingly being built upon deep end-to-end methods, there is little work that tries to develop confidence measures in this context. We fill this gap by providing an extensive benchmark of popular confidence methods on four well-known speech datasets. There are two challenges we overcome in adapting existing methods: working on structured data (sequences) and obtaining confidences at a coarser level than the predictions (words instead of tokens). Our results suggest that a strong baseline can be obtained by scaling the logits by a learnt temperature, followed by estimating the confidence as the negative entropy of the predictive distribution and, finally, sum pooling to aggregate at word level.


Merve Unlu, Ebru Arisoy

Paper and Video (new tab)

This paper describes a spoken question answering system that utilizes the uncertainty in automatic speech recognition (ASR) to mitigate the effect of ASR errors on question answering. Spoken question answering is typically performed by transcribing spoken content with an ASR system and then applying text-based question answering methods to the ASR transcriptions. Question answering on spoken documents is more challenging than question answering on text documents since ASR transcriptions can be erroneous and this degrades the system performance. In this paper, we propose integrating confusion networks with word confidence scores into an end-to-end neural network-based question answering system that works on ASR transcriptions. Integration is performed by generating uncertainty-aware embedding representations from confusion networks. The proposed approach improves F1 score in a question answering task developed for spoken lectures by providing tighter integration of ASR and question answering.


Tomek Rutowski, Elizabeth Shriberg, Amir Harati, Yang Lu, Piotr Chlebek, Ricardo Oliveira

Paper and Video (new tab)

Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.


Shih-Hsuan Chiu, Berlin Chen

Paper and Video (new tab)

More recently, Bidirectional Encoder Representations from Transformers (BERT) was proposed and has achieved impressive success on many natural language processing (NLP) tasks such as question answering and language understanding, due mainly to its effective pre-training then fine-tuning paradigm as well as strong local contextual modeling ability. In view of the above, this paper presents a novel instantiation of the BERT-based contextualized language models (LMs) for use in reranking of N-best hypotheses produced by automatic speech recognition (ASR). To this end, we frame N-best hypothesis reranking with BERT as a prediction problem, which aims to predict the oracle hypothesis that has the lowest word error rate (WER) given the N-best hypotheses (denoted by PBERT). In particular, we also explore to capitalize on task-specific global topic information in an unsupervised manner to assist PBERT in N-best hypothesis reranking (denoted by TPBERT). Extensive experiments conducted on the AMI benchmark corpus demonstrate the effectiveness and feasibility of our methods in comparison to the conventional autoregressive models like the recurrent neural network (RNN) and a recently proposed method that employed BERT to compute pseudo-log-likelihood (PLL) scores for N-best hypothesis reranking.

1120: Tight Integrated End-to-End Training for Cascaded Speech Translation

Parnia Bahar, Tobias Bieschke, Ralf Schlueter, Hermann Ney

Paper and Video (new tab)

A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.


Huan-Yu Chen, Yun-Shao Lin, Chi-Chun Lee

Paper and Video (new tab)

Research into understanding humor has been investigated over centuries. It has recently attracted various technical effort in computing humor automatically from data, especially for humor in speech. Comprehension on the same speech and the ability to realize a humor event vary depending on each individual audience's background and experience. Most previous works on automatic humor detection or impression recognition mainly model the produced textual content only without considering audience responses. We collect a corpus of TED Talks including audience comments for each of the presented TED speech. We propose a novel network architecture that considers the natural entanglement between speech transcripts and user's online feedbacks as an integrative graph structure, where the content speech and online feedbacks are nodes where the edges are connected though their common words. Our model achieves 61.2\% of accuracy in a three-class classification on humor impression recognition on TED talks; our experiments further demonstrate viewers comments are essential in improving the recognition tasks, and a joint content-comment modeling achieves the best recognition.


Bipasha Sen, Aditya Agarwal, Mirishkar Ganesh, Anil Vuppala

Paper and Video (new tab)

Multilingual automatic speech recognition (ASR) system is a single entity capable of transcribing multiple languages sharing a common phone space. Performance of such a system is highly dependent on the compatibility of the languages. State of the art speech recognition systems are built using sequential architectures based on recurrent neural networks (RNN) limiting the computational parallelization in training. This poses a significant challenge in terms of time taken to bootstrap and validate the compatibility of multiple languages for building a robust multilingual system. Complex architectural choices based on self-attention networks are made to improve the parallelization thereby reducing the training time. In this work, we propose Reed, a simple system based on 1D convolutions which uses very short context to improve the training time. To improve the performance of our system, we use raw time-domain speech signals directly as input. This enables the convolutional layers to learn feature representations rather than relying on handcrafted features such as MFCC. We report improvement on training and inference times by atleast a factor of 4x and 7.4x respectively with comparable WERs against standard RNN based baseline systems on SpeechOcean's multilingual low resource dataset.

1436: Transformer-based Direct Speech-to-speech Translation with Transcoder

Takatomo Kano, Sakriani Sakti, Satoshi Nakamura

Paper and Video (new tab)

Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.

1441: Two-Way Neural Machine Translation: A Proof of Concept for Bidirectional Translation Modeling using a Two-Dimensional Grid

Parnia Bahar, Christopher Brix, Hermann Ney

Paper and Video (new tab)

Neural translation models have proven to be effective in capturing sufficient information from a source sentence and generating a high-quality target sentence. However, it is not easy to get the best effect for bidirectional translation, i.e., both source-to-target and target-to-source translation using a single model. If we exclude some pioneering attempts, such as multilingual systems, all other bidirectional translation approaches are required to train two individual models. This paper proposes to build a single end-to-end bidirectional translation model using a two-dimensional grid, where the left-to-right decoding generates source-to-target, and the bottom-to-up decoding creates target-to-source output. Instead of training two models independently, our approach encourages a single network to jointly learn to translate in both directions. Experiments on the WMT 2018 German-English and Turkish-English translation tasks show that the proposed model is capable of generating a good translation quality and has sufficient potential to direct the research.


Minguang Song, Yunxin Zhao, Shaojun Wang, Mei Han

Paper and Video (new tab)

Label smoothing has been shown as an effective regularization approach for deep neural networks. Recently, a context-sensitive label smoothing approach was proposed for training RNNLMs that improved word error rates on speech recognition tasks. Despite the performance gains, its plausible candidate words for label smoothing were confined to $n$-grams observed in training data. To investigate the potential of label smoothing in model training with insufficient data, in this current work, we propose to utilize the similarity between word embeddings to build a candidate word set for each target word, where by doing so, plausible words outside the n-grams in training data may be found and introduced into candidate word sets for label smoothing. Moreover, we propose to combine the smoothing labels from the n-gram based and the word similarity based methods to improve the generalization capability of RNNLMs. Our proposed approach to RNNLM training has been evaluated for n-best list rescoring on speech recognition tasks of WSJ and AMI, with improved experimental results on word error rates confirming its effectiveness.

Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.


Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.


Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.