Title: Handling imbalanced resources and loanwords in Vietnamese-Bahnaric neural machine translation

Authors: Long-Ngo-Hoang Bui; Huu-Thien-Phu Nguyen; Minh-Khoi Le; Cong-Thien Pham; Thanh-Tho Quan

Addresses: Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam ' Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam ' Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam ' Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam ' Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

Abstract: Machine translation is a crucial application. Recent deep learning (DL) architectures support the neural machine translation (NMT) to achieve significant milestones, bridge the gap between human and machines translation. However, the NMT still faces challenges when involved with extremely low-resource languages of ethnic groups, e.g., the Bahnaric in Vietnam. The challenges come from the imbalance of language resources compare to the target languages, which also causes the loanwords to occur frequently in the target language. In this paper, we propose a novel solution of handling the scarcity problem of the NMT. Inspired from the work of incorporation of contextual embedding from pre-trained language models in BERT-fused NMT. We combine both solutions to formed one model that effectively handle imbalanced resources and loanwords scenarios. Experimental results show effectiveness on the Vietnamese-Bahnaric pair by outperforming the state-of-the-art BERT-fused NMT in more than five BLEU scores.

Keywords: machine translation; imbalanced resources; contextual embedding; loanwords.

DOI: 10.1504/IJIIDS.2024.141776

International Journal of Intelligent Information and Database Systems, 2024 Vol.16 No.4, pp.451 - 472

Received: 06 Oct 2023
Accepted: 25 Apr 2024

Published online: 01 Oct 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article