Platinum Sponsors

Gold Sponsors

Silver Sponsors

IEEE SLT 2021 Online Conference

Jan 20, Wed

You can join a session by clicking the session title. Clicking any paper title will show the abstract, and the link to paper and video.
Only users who registered for SLT 2021 can access paper, video and Zoom - to register, click here.
Session: Speech Recognition: model architectures Join Zoom

Chair: Abdelrahman Mohamed, Anton Ragni

Session: Human Computer Interaction Join Zoom

Chair: Qingyang Hong, Sayaka Shiota



Huahuan Zheng, Keyu An, Zhijian Ou

Paper and Video (new tab)

Neural Architecture Search (NAS), the process of automating architecture engineering, is an appealing next step to advancing end-to-end ASR, replacing expert-designed networks with learned, task-specific architectures. In contrast to early computational-demanding NAS methods, recent gradient-based NAS methods, e.g. DARTS (Differentiable ARchiTecture Search), SNAS (Stochastic NAS) and ProxylessNAS, significantly improve the NAS efficiency. In this paper, we make two contributions. First, we rigorously develop an efficient NAS method via Straight-Through (ST) gradients, called ST-NAS. Basically, ST-NAS uses the loss from SNAS but uses ST to back-propagate gradients through discrete variables to optimize the loss, which is not revealed in ProxylessNAS. Using ST gradients to support sub-graph sampling is a core element to achieve efficient NAS beyond DARTS and SNAS. Second, we successfully apply ST-NAS to end-to-end ASR. Experiments over the widely benchmarked 80-hour WSJ and 300-hour Switchboard datasets show that the ST-NAS induced architectures significantly outperform the human-designed architecture across the two datasets. Strengths of ST-NAS such as architecture transferability and low computation cost in memory and time are also reported.


Muralikrishna H, Shikha Gupta, Dileep Aroor Dinesh, Padmanabhan Rajan

Paper and Video (new tab)

State-of-the-art systems for spoken language identification (LID) use i-vector or embedding extracted using a deep neural network (DNN) to represent the utterance. These fixed-length representations are obtained without explicitly considering the relevance of individual frame-level feature vectors in deciding the class label. In this paper, we propose a new method to represent the utterance that considers the relevance of the individual frame-level features. The proposed representation can also preserve the locally available LID-specific information in the input features to some extent. To better utilize the local-level information in the new representation, we propose a novel segment-level matching kernel based support vector machine (SVM) classifier. The proposed representation of the utterance based on the relevance of frame-level features improves the robustness of the LID system to different background noise conditions in the speech. The experiments conducted on speech with different background conditions show that the proposed approach performs better than state-of-the-art approaches in noisy speech and performs similarly to the state-of-the-art systems in clean speech condition.


Haoyu Li, Yang Ai, Junichi Yamagishi

Paper and Video (new tab)

High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out the channel characteristics from the original input audio using the encoder network with adversarial training. Next, we disentangle the channel factor from a reference audio. Conditioned on this factor, an auto-regressive decoder is then used to predict the target-environment Mel spectrogram. Finally, we apply a neural vocoder to synthesize the speech waveform. Experimental results show that the proposed system can generate a professional high-quality speech waveform when setting high-quality audio as the reference. It also improves speech enhancement performance compared with several state-of-the-art baseline systems.


1156: Transformer Based Deliberation for Two-Pass Speech Recognition

Ke Hu, Ruoming Pang, Tara Sainath, Trevor Strohman

Paper and Video (new tab)

Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that deliberation networks can be effective second-pass models. These models accept two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the ``encoder-decoder" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate (WER) improvements along with a 38% reduction in computation. We also compare against a non-deliberation transformer rescoring, and find a 9% relative improvement.

1264: VoxLingua107: a Dataset for Spoken Language Recognition

Jörgen Valk, Tanel Alumäe

Paper and Video (new tab)

This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.


Ying Shi, Haolin Chen, Zhiyuan Tang, Lantian Li, Dong Wang, Jiqing Han

Paper and Video (new tab)

Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, as exampled by the VAE-NMF architecture. Compared to conventional approaches that represent clean speech by shallow models such as Gaussians with a low-rank covariance, the new approach employs deep generative models to represent the clean speech, which often provides a better prior. Despite the clear advantage in theory, we argue that deep priors must be used with much caution, as the likelihood produced by a deep generative does not always coincide with the speech quality. We designed a comprehensive study on this issue and demonstrated that based on deep speech priors, a reasonable SE performance can be achieved, but the results might be suboptimal. A careful analysis showed that this problem is deeply rooted in the disharmony between the flexibility of deep generative models and the nature of the maximum-likelihood (ML) training.



haoneng luo, shiliang zhang, ming lei, lei xie

Paper and Video (new tab)

Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules -- position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory blocks instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model can achieve over 20% reduction in model parameters and 6.7% relative CER reduction on the AISHELL-1 task. With impressively 20% parameter reduction, our model shows no loss of recognition performance on the 20,000-hour large-scale task.

1147: Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

Xiaosu Tong, Che-Wei Huang, Sri Harish Mallidi, Shaun Joseph, Sonal Pareek, Chander Chandak, Ariya Rastrow, Roland Maas

Paper and Video (new tab)

In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41\% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.


Yanpei Shi, Thomas Hain

Paper and Video (new tab)

Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output. The second approach is a contextual joint factor analysis encoder, where the encoder is trained to analyse joint factors from the source signal that correlates best with the neighbouring audio. To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted- phone classification and speaker recognition - and test on different TIMIT data sets. Experimental results show that one of the proposed approaches outperforms phone classification baselines, yielding a classification accuracy of 74.1\%. When using additional out-of-domain data for training, an additional 3\% improvements can be obtained, for both for phone classification and speaker recognition tasks.



Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

Paper and Video (new tab)

In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. And each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.


Fang Kang, Feiran Yang, Jun Yang

Paper and Video (new tab)

In this paper, we present a real-time blind source separation (BSS) algorithm, which unifies the independent vector analysis (IVA) as a spatial model and a deep neural network (DNN) as a source model. The auxiliary-function based IVA (AuxIVA) is utilized to update the demixing matrix, and the required time-varying variance of the speech source is estimated by a DNN. The DNN could provide a more accurate source model, which then helps to optimize the spatial model. In addition, because the DNN is used to estimate the source variance instead of the source power spectrogram, the size of DNN can be reduced significantly. Experiment results show that the joint utilization of the model-based approach and the data-driven approach provides a more efficient solution than just alone in terms of convergence rate and source separation performance.


Yanpei Shi, Thomas Hain

Paper and Video (new tab)

Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a twospeaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.



Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

Paper and Video (new tab)

Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encoder layers in Transformers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We further developed a novel metric of the diagonality of attention matrices and found the learned diagonality indeed increases from the lower to upper encoder self-attention layers. We conclude the global view is unnecessary in training upper encoder layers.


Amit Meghanani, Anoop C. S., Ramakrishnan A. G.

Paper and Video (new tab)

In this work, we explore the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer's dementia (AD) recognition on ADReSS challenge dataset. We use three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction: (i) convolutional neural network followed by a long-short term memory network (CNN-LSTM), (ii) pre-trained ResNet18 network followed by LSTM (ResNet-LSTM), and (iii) pyramidal bidirectional LSTM followed by a CNN (pBLSTM-CNN). CNN-LSTM achieves an accuracy of 64.58\% with MFCC features and ResNet-LSTM achieves an accuracy of 62.5\% using log-Mel spectrograms. pBLSTM-CNN and ResNet-LSTM models achieve root mean square errors (RMSE) of 5.9 and 5.98 in the MMSE score prediction, using the log-Mel spectrograms. Our results beat the baseline accuracy (62.5\%) and RMSE (6.14) reported for acoustic features on ADReSS challenge dataset. The results suggest that log-Mel spectrograms and MFCCs are effective features for AD recognition problem when used with DNN models.

1198: Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising

Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu

Paper and Video (new tab)

This paper proposes a new joint optimization framework for simultaneous dereverberation, acoustic echo cancellation, and denoising, which is motivated by the recently proposed convolutional beamformer for simultaneous denoising and dereverberation. Using the echo aware mask based beamforming framework, the proposed algorithm could effectively deal with double-talk case and local inference, etc. The evaluations based on ERLE for echo only, and PESQ for double-talk demonstrate that the proposed algorithm could significantly improve the performance.



Thomas Pellegrini, Romain Zimmer, Timothée Masquelier

Paper and Video (new tab)

Deep Neural Networks (DNNs) are the current state-of-the-art models in many speech related tasks. There is a growing interest, though, for more biologically realistic, hardware friendly and energy efficient models, named Spiking Neural Networks (SNNs). Recently, it has been shown that SNNs can be trained efficiently, in a supervised manner, using backpropagation with a surrogate gradient trick. In this work, we report speech command (SC) recognition experiments using supervised SNNs. We explored the Leaky-Integrate-Fire (LIF) neuron model for this task, and show that a model comprised of stacked dilated convolution spiking layers can reach an error rate very close to standard DNNs on the Google SC v1 dataset: 5.5%, while keeping a very sparse spiking activity, below 5%, thank to a new regularization term. We also show that modeling the leakage of the neuron membrane potential is useful, since the LIF model outperformed its non-leaky model counterpart significantly.

1174: Film Quality Prediction using Acoustic, Prosodic and Lexical Cues

Su Ji Park, Alan Rozet

Paper and Video (new tab)

In this paper, we propose using acoustic, prosodic, and lexical features to identify movie quality as a decision support tool for film producers. Using a dataset of movie trailer audio clips paired with audience ratings for the corresponding film, we trained machine learning models to predict a film's rating. We further analyze the impact of prosodic features with neural network feature importance approaches and find differing influence across genres. We finally compare acoustic, prosodic, and lexical feature variance against film rating, and find some evidence for an inverse association.


Aditya Jayasimha ,Periyasamy Paramasivam

Paper and Video (new tab)

Start Point Detection (SPD) and End Point Detection (EPD) in Automatic Speech Recognition (ASR) systems are the tasks of detecting the time at which the user starts speaking and stops speaking respectively. They are crucial problems in ASR as inaccurate detection of SPD and/or EPD leads to poor ASR performance and bad user experience. The main challenge involved in SPD and EPD is accurate detection in noisy environments, especially when speech noise is significant in the background. The current approaches tend to fail to distinguish between the speech of the real user and speech in the background. In this work, we aim to improve SPD and EPD in a multi-speaker environment. We propose a novel approach that personalizes SPD and EPD to a desired user and helps improve ASR quality and latency. We combine user-specific information (i-vectors) with traditional speech features (log-mel) and build a Convolutional, Long Short-Term Memory, Deep Neural Network (CLDNN) model to achieve personalized SPD and EPD. The proposed approach achieves a relative improvement of 46.53% and 11.31% in SPD accuracy, and 27.87% and 5.31% in EPD accuracy at SNR 0 and 5 dB respectively over the standard non-personalized models.



yuxiang kong, jian wu, quandong wang, peng gao, weiji zhuang, yujun wang, lei xie

Paper and Video (new tab)

The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multichannel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.


Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

Paper and Video (new tab)

Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data.


1426: Dynamically Weighted Ensemble Models for Automatic Speech Recognition

Kiran Praveen, Abhishek Pandey, Deepak Kumar, Shakti Prasad Rath, Sandip Bapat

Paper and Video (new tab)

In machine learning, training multiple models for the same task, and using the outputs from all the models helps reduce the variance of the combined result. Using an ensemble of models in classification tasks such as Automatic Speech Recognition (ASR) improves the accuracy across different target domains such as multiple accents, environmental conditions, and other scenarios. It is possible to select model weights for the ensemble in numerous ways. A classifier trained to identify target domain, a simple averaging function, or an exhaustive grid search are the common approaches to obtain suitable weights. All these methods suffer either in choosing sub-optimal weights or by being computationally expensive. We propose a novel and practical method for dynamic weight selection in an ensemble, which can approximate a grid search in a time-efficient manner. We show that a combination of weights always performs better than assigning uniform weights for all models. Our algorithm can utilize a validation set if available or find weights dynamically from the input utterance itself. Experiments conducted for various ASR tasks show that the proposed method outperforms the uniformly weighted ensemble in terms of Word Error Rate (WER) in our experiments.


Chenda Li, Jing Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Boeddeker, Zhuo Chen, Shinji Watanabe

Paper and Video (new tab)

We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation). It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets.


Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.


Catalin Zorila, Mohan Li, Rama Doddipatla

Paper and Video (new tab)

This paper presents an investigation into the effectiveness of spatial features for improving time-domain speaker extraction systems. A two-dimensional Convolutional Neural Network (CNN) based encoder is proposed to capture the spatial information within the multichannel input, which are then combined with the spectral features of a single channel extraction network. Two variants of target speaker extraction methods were tested, one which employs a pre-trained i-vector system to compute a speaker embedding (System A), and one which employs a jointly trained neural network to extract the embeddings directly from time domain enrolment signals (System B). The evaluation was performed on the spatialized WSJ0-2mix dataset using the Signal-to-Distortion Ratio (SDR) metric, and ASR accuracy. In the anechoic condition, more than 10 dB and 7 dB absolute SDR gains were achieved when the 2-D CNN spatial encoder features were included with Systems A and B, respectively. The performance gains in reverberation were lower, however, we have demonstrated that retraining the systems by applying dereverberation preprocessing can significantly boost both the target speaker extraction and ASR performances.


Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.