ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis Online publication date: Mon, 19-Jun-2023
by Tushar H. Ghorpade; Subhash K. Shinde
International Journal of Intelligent Systems Technologies and Applications (IJISTA), Vol. 21, No. 2, 2023
Abstract: The current growth in e-content is attributed to, information exchanged through social media, e-news, etc. Several researchers have proposed an encoder-decoder model with impressive accuracy. This paper exploits feature extraction from images and text for the encoder model using a word embedding method with proposed convolutional layers. State-of-the-art image-to-text and text-to-speech (ITTS) systems learn models separately, one describes the content of an image and the other follows with speech generation. We adopted the Tacotron model for the naturalness of a text with most popular datasets. It can also consistently analyse using evaluation metrics like bilingual evaluation understudy (BLEU), METOr, and mean opinion scale (MOS). The proposed method can significantly enhance the performance and competitive results of a standard image caption and speech generation model. The results show that we obtained an improvement by almost 4% to 5% BLEU score in image captioning model and approximately MOS is 3.73 in speech model.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Intelligent Systems Technologies and Applications (IJISTA):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com