Platinum Sponsors

Gold Sponsors

Silver Sponsors

IEEE SLT 2021 Online Conference

Jan 21, Thu

You can join a session by clicking the session title. Clicking any paper title will show the abstract, and the link to paper and video.
Only users who registered for SLT 2021 can access paper, video and Zoom - to register, click here.
Session: Emotion Recognition Join Zoom

Chair: Elie Khoury, Liping Chen

Session: Neural Vocoder and Others Join Zoom

Chair: Junichi Yamagishi, Li-Juan Liu


Kazuhiro Nakadai, Yosuke Fukumoto, Ryu Takeda

Paper and Video (new tab)

This paper investigates node-pruning-based compression for non-uniform deep learning models such as acoustic models in automatic speech recognition (ASR). Node pruning for small footprint ASR has been well studied, but most studies assumed a sigmoid as an activation function and uniform or simple fully-connected neural networks without bypass connections. We propose a node pruning method that can be applied to non-sigmoid functions such as ReLU and that can deal with network topology related issues such as bypass connections. To deal with non-sigmoid functions, we extend a node entropy technique to estimate node activities. To cope with non-uniform network topology, we propose three criteria; inter-layer pairing, no bypass connection pruning, and layer-based pruning rate configuration. The proposed method as a combination of these four techniques and criteria was applied to compress a Kaldi’s acoustic model with ReLU as a non-linear function, time delay neural networks (TDNN) and bypass connections inspired by residual networks. Experimental results showed that the proposed method achieved a 31% speed increase while maintaining the ASR accuracy to be comparable by taking network topology into consideration.


Bo-Hao Su, Chi-Chun Lee

Paper and Video (new tab)

Speech emotion recognition (SER) is important in enabling personalized services in our life. It also becomes a prevalent topic of research with its potential in creating a better user experience across many modern speech technologies. However, the highly contextualized scenario and expensive emotion labeling required cause a severe mismatch between already limited-in-scale speech emotional corpora; this hinders the wide adoption of SER. In this work, instead of conventionally learning a common feature space between corpora, we take a novel approach in enhancing the variability of the source (labeled) corpus that is target (unlabeled) data-aware by generating synthetic source domain data using a conditional cycle emotion generative adversarial network (CCEmoGAN). We evaluate our framework in cross corpus emotion recognition tasks and obtain a three classes valence recognition accuracy of 47.56%, 50.11% and activation accuracy of 51.13%, 65.7% when transferring from the IEMOCAP to the CIT dataset, and the IEMOCAP to the MSP-IMPROV dataset respectively. The benefit of increasing target domain-aware variability in the source domain to improve emotion discriminability in cross corpus emotion recognition is further visualized in our augmented data space.


Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim

Paper and Video (new tab)

This paper proposes a spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN vocoder successfully generates waveform sequences using a fast non-autoregressive WaveNet model. By employing multiresolution short-time Fourier transform (MR-STFT) criteria with a generative adversarial network, the light-weight convolutional networks can be effectively trained without any distillation process. To further improve the vocoding performance, we propose the application of frequency-dependent weighting to the MR-STFT loss function. The proposed method penalizes perceptually-sensitive errors in the frequency domain; thus, the model is optimized toward reducing auditory noise in the synthesized speech. Subjective listening test results demonstrate that our proposed method achieves 4.21 and 4.26 TTS mean opinion scores for female and male Korean speakers, respectively.


Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

Paper and Video (new tab)

For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability. In this work, we propose local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to an end-to-end model trained on unlabeled speech. We demonstrate that LPM is simple to implement and superior to existing knowledge distillation techniques under comparable settings. Starting from a baseline trained on 100 hours of labeled speech, with an additional 360 hours of unlabeled data, LPM recovers 54%/82% and 73%/91% of the word error rate on clean and noisy test sets with/without language model rescoring relative to a fully supervised model on the same data.


Michael Neumann, Ngoc Thang Vu

Paper and Video (new tab)

In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features. We attempt to answer the following research questions: (i) How does speech emotion recognition perform on noisy data? and (ii) To what extend does a multimodal approach improve the accuracy and compensate for potential performance degradation at different noise levels? We present an analytical investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios, comparing three types of acoustic features. Visual features are incorporated with a hybrid fusion approach: The first neural network layers are separate modality-specific ones, followed by at least one shared layer before the final prediction. The results show a significant performance decrease when a model trained on clean audio is applied to noisy data and that the addition of visual features alleviates this effect.


Yang Ai, Haoyu Li, Xin Wang, Junichi Yamagishi, Zhenhua Ling

Paper and Video (new tab)

This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the noisy and reverberant LAS, noise LAS related to the noise information, and room impulse response related to the reverberation information then performs initial denoising and dereverberation. The initial processed LAS are then enhanced by another neural network as the final clean LAS. To further improve the quality of the generated clean LAS, we also introduce a bandwidth extension model and frequency resolution extension model in the DNR-ASP. The experimental results indicate that the DNR-HiNet vocoder was able to generate a denoised and dereverberated waveform given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. We also applied the DNR-HiNet vocoder to speech enhancement tasks, and its performance was competitive with several advanced speech enhancement methods.


Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, Joon Son Chung

Paper and Video (new tab)

The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. Therefore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing high false alarm rates in real-world scenarios. In reality, keyword spotting is a detection problem where predefined target keywords are detected from a variety of unknown sounds. This shares many similarities to metric learning problems in that the unseen and unknown non-target sounds must be clearly differentiated from the target keywords. However, a key difference is that the target keywords are known and predefined. To this end, we propose a new method based on metric learning that maximises the distance between target and non-target keywords, but also learns per-class weights for target keywords as in classification objectives. Experiments on the Google Speech Commands dataset show that our method significantly reduces false alarms to unseen non-target keywords, while maintaining the overall classification accuracy.


Patrick Meyer, Ziyi Xu, Tim Fingscheidt

Paper and Video (new tab)

Deep learning has increased the interest in speech emotion recognition (SER) and has put forth diverse structures and methods to improve performance. In recent years it has turned out that applying SER on a (log-mel) spectrogram and thus, interpreting SER as an image recognition task is a promising method. Following the trend towards using a convolutional neural network (CNN) in combination with a bidirectional long short-term memory (BLSTM) layer, and some subsequent fully connected layers, in this work, we advance the performance of this topology by several contributions: We integrate a multi-kernel width CNN, propose a BLSTM output summarization function, apply an enhanced feature representation, and introduce an effective training method. In order to foster insight into our proposed methods, we separately evaluate the impact of each modification in an ablation study. Based on our modifications, we obtain top results for this type of topology on IEMOCAP with an unweighted average recall of 64.5 % on average.

1262: MelGlow: Efficient Waveform Generative Network Based on Location-Variable Convolution

Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

Paper and Video (new tab)

Recent neural vocoders usually use a wavenet-like network to capture the long-term dependencies of the waveform, but a large number of parameters are required to obtain good modeling capabilities. In this paper, an efficient network, named location-variable convolution, is proposed to model the dependencies of waveform. Different from the use of unified convolution kernels in WaveNet to capture the dependencies of arbitrary waveforms, the location-variable convolution utilizes a kernel predictor to generate multiple sets of convolutions kernel based on the mel-spectrum, where each set of convolution kernel is used to perform convolution operation on the associated intervals of waveform. Combining with WaveGlow and the location-variable convolution, an efficient vocoder, named as MelGlow, is designed. Experiments on the LJSpeech dataset show that MelGlow achieves better performance than WaveGlow at small model sizes, which verifies the effectiveness and potential optimization space of the location-variable convolution.


Alexandru-Lucian Georgescu, Cristian Manolache, Dan Oneață, Horia Cucu, Corneliu Burileanu

Paper and Video (new tab)

Self-training is a simple and efficient way of leveraging unlabeled speech data: (i) start with a seed system trained on transcribed speech; (ii) pass the unlabeled data through this seed system to automatically generate transcriptions; (iii) en-large the initial dataset with the self-labeled data and retrain the speech recognition system. However, in order not to pollute the augmented dataset with incorrect transcriptions, an important intermediary step is to select those parts of the self-labeled data that are accurate. Several approaches have been proposed in the community, but most of the works address only a single method. In contrast, in this paper we inspect three distinct classes of data-filtering for self-training, leveraging: (i) confidence scores, (ii) multiple ASR hypotheses and (iii) approximate transcriptions. We evaluate these approaches from two perspectives: quantity vs. quality of the selected data and improvement of the seed ASR by including this data. The proposed methodology achieves state-of-the-art results on Romanian speech, obtaining 25% relative improvement over prior work. Among the three methods, approximate transcriptions bring the highest performance gain, even if they yield the least quantity of data.


Manon Macary, Marie Tahon, Yannick Estève, Anthony Rousseau

Paper and Video (new tab)

Pre-training for feature extraction is an increasingly studied approach to get better continuous representations of audio and text content. In the present work, we use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech (SER) on AlloSat, a large French emotional database describing the satisfaction dimension, and on the state of the art corpus SEWA focusing on valence, arousal and liking dimensions. To the authors’ knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task, usually characterized by a small amount of labeled training data. Evaluated by the well-known concordance correlation coefficient (CCC), our experiments show that we can reach a CCC value of 0.825 instead of 0.592 when using MFCC in conjunction with word2vec word embeddings on the AlloSat dataset.

1419: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

Paper and Video (new tab)

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.


Prakhar Swarup, Debmalya Chakrabarty, Ashtosh Sapru, Hitesh Tulsiani, Harish Arsikere, Sri Garimella

Paper and Video (new tab)

Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabelled data to improve the accuracy of speech recognition systems. While the previous studies have established the efficacy of various SSL methods on varying amounts of data, this paper presents largest ASR SSL experiment ever conducted till date where 75K hours of transcribed and 1.2 million hours of untranscribed data is used for model training. In addition, the paper introduces couple of novel techniques to facilitate such a large scale experiment: 1) a simple scalable Teacher-Student based SSL method for connectionist temporal classification (CTC) objective and 2) effective data selection mechanisms for leveraging massive amounts of unlabelled data to boost the performance of student models. Further, we apply SSL in all stages of the acoustic model training, including final stage sequence discriminative training. Our experiments indicate encouraging word error rate (WER) gains up to 14 % in such a large scale transcribed data regime due to the SSL training.


Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram

Paper and Video (new tab)

Emotion recognition is a challenging task due to limited availability of in-the-wild labeled datasets. Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language. Models such as BERT learn to incorporate context in word embeddings, which translates to improved performance in downstream tasks like question answering. In this work, we extend self-supervised training to multi-modal applications. We learn multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features. This model is fine-tuned on the downstream task of emotion recognition. Our results on the CMU-MOSEI dataset show that this pre-training technique can improve the emotion recognition performance by up to 3% compared to the baseline.


song li, beibei ouyang, lin li, qingyang hong

Paper and Video (new tab)

With the development of deep learning, end-to-end neural text-to-speech systems have achieved significant improvements on high-quality speech synthesis. However, most of these systems are attention-based autoregressive models, resulting in slow synthesis speed and large model parameters. In this paper, we propose a new lightweight non-autoregressive multi-speaker speech synthesis system, named LightSpeech, which utilizes the lightweight feedforward neural networks to accelerate synthesis and reduce the amount of parameters. With the speaker embedding, LightSpeech achieves multi-speaker speech synthesis extremely quickly. Experiments on the LibriTTS dataset show that, compared with FastSpeech, our smallest LightSpeech model achieves a 9.27x Mel-spectrogram generation acceleration on CPU, and the model size and parameters are compressed by 37.06x and 37.36x, respectively.

1330: Towards unsupervised learning of speech features in the wild

Morgane Riviere, Emmanuel Dupoux

Paper and Video (new tab)

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has been tested on clean, curated speech datasets. Can it also be used with unprepared audio data “in the wild”? Here, we explore three problems that may hinder unsupervised learning in the wild: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a clean speech-only dataset, these problems combined can have a performance cost of up to 30% relative for the ABX score.We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech parts inside a file, and perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive segment of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.


Shi-wook Lee

Paper and Video (new tab)

Domain generalization is a major challenge for cross-corpus speech emotion recognition. The recognition performance built on "seen" source corpora is inevitably degraded when the systems are tested against "unseen" target corpora that have different speakers, channels, and languages. We present a novel framework based on a triplet network to learn more generalized features of emotional speech that are invariant across multiple corpora. To reduce the intrinsic discrepancies between source and target corpora, an explicit feature transformation based on the triplet network is implemented as a preprocessing step. Extensive comparison experiments are carried out on three emotional speech corpora; two English corpora, and one Japanese corpus. Remarkable improvements of up-to 35.61% are achieved for all cross-corpus speech emotion recognition, and we show that the proposed framework using the triplet network is effective for obtaining more generalized features across multiple emotional speech corpora.


Hongqiang DU, Xiaohai Tian, Lei Xie, Haizhou Li

Paper and Video (new tab)

We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level. While the proposed training scheme is applicable to any voice conversion networks, we formulate the study under the average model voice conversion framework in this paper. Experiments conducted on CMU-ARCTIC and CSTR-VCTK corpus confirm that the proposed method outperforms baseline methods in terms of speaker similarity.


Bowen Shi, Shane Settle, Karen Livescu

Paper and Video (new tab)

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models.


Alice Baird, Shahin Amiriparian, Manuel Milling, Björn Schuller

Paper and Video (new tab)

Speaking in public can be a cause of fear for many people. Research suggests that there are physical markers such as an increased heart rate and vocal tremolo that indicate an individual's state of wellbeing during a public speech. In this study, we explore the advantages of speech-based features for continuous recognition of the emotional dimensions of arousal and valence during a public speaking scenario. Furthermore, we explore biological signal fusion, and perform cross-language (German and English) analysis by training language-independent models and testing them on speech from various native and non-native speaker groupings. For the emotion recognition task itself, we utilise a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture with a self-attention layer. When utilising audio-only features and testing with non-native German's speaking German we achieve at best a concordance correlation coefficient (CCC) of 0.640 and 0.491 for arousal and valence, respectively - demonstrating a strong effect for this task from non-native speakers, as well as promise for the suitability of deep learning for continuous emotion recognition in the context of public speaking.


Tzu-hsien Huang, Jheng-hao Lin, Hung-yi Lee

Paper and Video (new tab)

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear. In this paper, we gave a thorough study of the robustness of known VC models. We also modified these models, such as the replacement of speaker embeddings, to further improve their performances. We found that the sampling rate and audio duration greatly influence voice conversion. All the VC models suffer from unseen data, but AdaIN-VC is relatively more robust. Also, the speaker embedding jointly trained is more suitable for voice conversion than those trained on speaker identification.


Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig

Paper and Video (new tab)

End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing auxiliary tasks. We propose (i) using the same auxiliary task as primary RNN-T ASR task, and (ii) performing context-dependent graphemic state prediction as in conventional hybrid modeling. In transcribing social media videos with varying training data size, we first evaluate the streaming ASR performance on three languages: Romanian, Turkish and German. We find that both proposed methods provide consistent improvements. Next, we observe that both auxiliary tasks demonstrate efficacy in learning deep transformer encoders for RNN-T criterion, thus achieving competitive results - 2.0%/4.2% WER on LibriSpeech test-clean/other - as compared to prior top performing models.

Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.


Heyang Xue, Shan Yang, Yi Lei, Lei Xie, Xiulin Li

Paper and Video (new tab)

Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it's hard for many of us to sing like a professional singer. In this paper, we propose an approach -- Learn2Sing that only needs a singing teacher to generate the target speakers' singing voice without their singing voice data. In our approach, a teacher's singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an auxiliary feature as the inputs of the acoustic model for building a unified input representation. In order to enable the target speaker to sing without singing reference audio in the inference stage, a duration model and an LF0 prediction model are also trained. Particularly, we employ domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Our experiments indicate that the proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples.

Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

1279: Unsupervised Acoustic-To-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient

Hayato Shibata, Mingxin Zhang, Takahiro Shinozaki

Paper and Video (new tab)

This paper presents an unsupervised learning method of deep neural networks that perform acoustic-to-articulatory inversion for arbitrary utterances. Conventional unsupervised acoustic-to-articulatory inversion methods are based on the analysis-by-synthesis approach and non-linear optimization algorithms. One limitation is that they require time-consuming iterative optimizations to obtain articulatory parameters for a given target speech segment. Neural networks, after learning their relationship, can obtain these articulatory parameters without an iterative optimization. However, conventional methods need supervised learning and paired acoustic and articulatory samples. We propose a hybrid auto-encoder based unsupervised learning framework for the acoustic-to-articulatory inversion neural networks that can capture context information. The essential point of the framework is making the training effective. We investigate several reinforcement learning algorithms and show the usefulness of the deterministic policy gradient. Experimental results demonstrate that the proposed method can infer articulatory parameters not only for training set segments but also for unseen utterances. Averaged reconstruction errors achieved for open test samples are similar to or even lower than the conventional method that directly optimizes the articulatory parameters in a closed condition.


Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.