Title: Constructing a Chinese-Vietnamese bilingual corpus from subtitle websites

Authors: Phuc-Nghi Nguyen; Phuoc Tran

Addresses: Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam ' Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam

Abstract: In this work, we introduce a method of constructing a Chinese-Vietnamese bilingual corpus on subtitle resources. The corpus construction process involved careful curation and preprocessing of the chosen subtitle data to ensure its suitability for training and evaluating machine translation models. We applied rigorous quality control measures to enhance the reliability and relevance of the collected corpus by systematically eliminating entries that did not meet a predetermined level of correctness. We use the two robust neural machine translation models to experiment on the collected corpus. The experimental results show that the highest BLEU score of the collected corpus is 22.0, much higher than the OpenSubtitles 2016 corpus - one of the most popular subtitle corpus today. By curating a specialised corpus, we aim to contribute valuable resources to the field of machine translation, fostering advancements in the understanding and improvement of translation quality between Chinese and Vietnamese.

Keywords: Chinese-Vietnamese bilingual corpus; machine translation; Netflix.

DOI: 10.1504/IJIIDS.2024.141748

International Journal of Intelligent Information and Database Systems, 2024 Vol.16 No.4, pp.385 - 408

Received: 13 Jun 2023
Accepted: 29 Jan 2024

Published online: 01 Oct 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article