Title: Chinese and Vietnamese cross-lingual topic discovery based on word similarity of comparable corpus
Authors: Zhengtao Yu; Linjie Xia; Peili Tang; Xiaocong Wang; Shengxiang Gao
Addresses: Department of Information Engineering and Automation, Kunming University of Science and Technology, No. 727, Jingming South Road, Kunming, Yunnan Province, China ' Department of Information Engineering and Automation, Kunming University of Science and Technology, No. 727, Jingming South Road, Kunming, Yunnan Province, China ' Department of Information Engineering and Automation, Kunming University of Science and Technology, No. 727, Jingming South Road, Kunming, Yunnan Province, China ' Department of Information Engineering and Automation, Kunming University of Science and Technology, No. 727, Jingming South Road, Kunming, Yunnan Province, China ' Department of Information Engineering and Automation, Kunming University of Science and Technology, No. 727, Jingming South Road, Kunming, Yunnan Province, China
Abstract: In order to solve the problem of the scarcity of Chinese-Vietnamese comparable corpus and limited scale of bilingual dictionaries, we propose a method for cross-language topic discovery based on the similarity between Chinese and Vietnamese. Firstly, we use the Chinese-Vietnamese comparable corpus to train to get the word vectors representing the bilingual texts, and calculate the similarity between the Chinese query words and Vietnamese words. Then, we select out readily extended words from Vietnamese which are similar to the Chinese words. Subsequently, the Chinese-Vietnamese translation model is constructed from the similarity between the Chinese-Vietnamese words, to search the Vietnamese word from the translation model and return the related Vietnamese document. Finally, the AP algorithm is used to obtain the Vietnamese documents related to the Chinese text. The experimental results show that the proposed method has achieved good results in accuracy and recall rate.
Keywords: Chinese and Vietnamese; cross-language topic discovery; comparable corpus; word similarity; cross-language query translation model.
DOI: 10.1504/IJICT.2024.139861
International Journal of Information and Communication Technology, 2024 Vol.25 No.1, pp.35 - 47
Received: 29 Jan 2021
Accepted: 27 Apr 2022
Published online: 08 Jul 2024 *