Speech is a significant aspect of Artificial Intelligence. It is the most natural way of human communication, which carries information about lexical words, meanings, speaker identity, emotions, and other properties. At UT Dallas Speech & Machine Learning Lab, we build novel AI-based algorithms for speech applications.
Artificial Intelligence and Machine Learning
Machine learning is a branch of artificial intelligence (AI) which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Humans’ express meanings and feelings through speech. However, machines are still unable to do in a human-like manner, which motivates us to bridge the gap between speech and machine learning. The Speech & Machine Learning Lab at UT Dallas studies inter-disciplinary research that involves computational linguistics, speech processing, and deep learning methodology. We build cutting-edge neural models for speech processing.
Expressive Speech Synthesis
Speech synthesis is a significant aspect of artificial intelligence. Speech and Machine Learning Lab studies novel algorithms for text-to-speech synthesis. In text-to-speech synthesis, we focus on expressive rendering, prosodic quality, emotion, accents and multi-lingual synthesis. We have published in INTERSPEECH, ICASSP, ASRU and Speech Communication, IEEE/ACM TASLP. We contributed and participated in a number of international technology evaluations. Our team achieved top scores in ZeroSpeech Challenge “TTS without T” at INTERSPEECH 2019. The team also delivered tutorials on speech synthesis and voice conversion in APSIPA ASC 2020. The team published speech synthesis papers in leading journals and conferences, including IEEE Transactions on Affective Computing, Neural Networks, IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE Signal Processing Letters, ASRU, INTERSPEECH and ICASSP.
Expressive Voice Conversion
Voice conversion (VC) is a significant aspect of artificial intelligence. It is the study of how to convert one’s voice to sound like that of another without changing the linguistic content. Voice conversion belongs to a general technical field of speech synthesis, which converts text to speech or changes the properties of speech, for example, voice identity, emotion, and accent. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. Stewart, a pioneer in speech synthesis, commented in 1922, the difficult problem involved in the artificial production of speech-sounds is not the making of a device which shall produce speech, but in the manipulation of the apparatus. As voice conversion is focused on the manipulation of voice identity in speech, it represents one of the challenging research problems in speech processing.
Emotion Understanding and Generation
Human speech is highly emotional in nature. At SML lab, we look at deeply understanding aspects of speech communication related to emotion. Learning emotional prosody implicitly presents challenges due to the subjective nature of emotions and the hierarchical structure of speech. Therefore, the understanding and generation of emotions is a challenging task, one that we aim to solve using the generalization power of deep learning. We work on deep learning frameworks both in voice conversion and text-to-speech that are capable of learning emotional prosody and its associated intricacies. We develop controllable, robust and adaptable emotional speech synthesis models both with single speaker and multi-speaker scenarios.