15 December 2021
Recently, end-to-end architectures have been influential for modeling of automatic speech recognition (ASR) systems, which aim to directly map acoustic inputs to character or word sequence and to simplify the training procedure. Plenty of end-to-end architectures have been proposed, for instance, Connectionist Temporal Classification (CTC), Sequence Transduction with Recurrent Neural Networks (RNN-T) and an attention-based encoder-decoder, which have accomplished significant success and performance on a variety of benchmarks or even reached human level on some tasks.
However, although advanced deep neural network architectures have been proposed, in adverse environments, the performance of ASR systems suffers from significant degradation because of environmental noise or ambient reverberation. This thesis contributes to the robustness of ASR systems by leveraging additional visual sequences, face information and domain knowledge. We achieve significant improvement on speech reconstruction, speech separation, end-to-end modeling and OOV word recognition tasks.
To participate in the PhD defense, please contact the Informatics Study Office via email.
The Zoom link will then be mailed to you shortly before the meeting.