Deep bi-directional LSTM network with CNN features for human emotion recognition in audio-video signals Online publication date: Thu, 24-Feb-2022
by Lovejit Singh
International Journal of Swarm Intelligence (IJSI), Vol. 7, No. 1, 2022
Abstract: The human emotion detection in audio-video signals is a challenging task. This paper proposed deep bi-directional long short-term memory (Bi-LSTM) network with convolution neural network (CNN) features-based human emotion detection method. First, it utilises the transfer learning Inception-ResNet V2 model to extract the CNN features from audio and video modalities. Furthermore, the frame-wise CNN features sequential information is learned by two separate Bi-LSTM models for audio and video channels, respectively. The weighted product rule-based decision level fusion method computes the final confidence scores with the output probabilities of two independent Bi-LSTM models. The proposed approach is validated, tested, and compared with existing deep learning-based audio-video emotion detection methods on the challenging Ryerson audio-visual database of emotional speech and song (RAVDESS). The experimental results show that the proposed approach has outperformed the existing methods. It has attained 81.03% validation and 83.98% testing emotion detection accuracy on RAVDESS dataset.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Swarm Intelligence (IJSI):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com