Title: Constructing a Chinese-Vietnamese bilingual corpus from subtitle websites
Authors: Phuc-Nghi Nguyen; Phuoc Tran
Addresses: Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam ' Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Abstract: In this work, we introduce a method of constructing a Chinese-Vietnamese bilingual corpus on subtitle resources. The corpus construction process involved careful curation and preprocessing of the chosen subtitle data to ensure its suitability for training and evaluating machine translation models. We applied rigorous quality control measures to enhance the reliability and relevance of the collected corpus by systematically eliminating entries that did not meet a predetermined level of correctness. We use the two robust neural machine translation models to experiment on the collected corpus. The experimental results show that the highest BLEU score of the collected corpus is 22.0, much higher than the OpenSubtitles 2016 corpus - one of the most popular subtitle corpus today. By curating a specialised corpus, we aim to contribute valuable resources to the field of machine translation, fostering advancements in the understanding and improvement of translation quality between Chinese and Vietnamese.
Keywords: Chinese-Vietnamese bilingual corpus; machine translation; Netflix.
DOI: 10.1504/IJIIDS.2024.141748
International Journal of Intelligent Information and Database Systems, 2024 Vol.16 No.4, pp.385 - 408
Received: 13 Jun 2023
Accepted: 29 Jan 2024
Published online: 01 Oct 2024 *