Title: Elementary discourse unit segmentation for Vietnamese texts
Authors: Chinh Trong Nguyen; Dang Tuan Nguyen
Addresses: University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam ' Saigon University, Ho Chi Minh City, Vietnam
Abstract: Elementary discourse unit (EDU) segmentation is an important problem in discourse analysis of text. In Vietnam, we do not have any tool or model official published to solve this problem yet. Therefore, we would like to propose a solution for this problem. Our approach is to apply a sequential labelling method for identifying the beginning of each EDU in a sentence. For sequential labelling method, we use a deep neural network architecture containing a BERT for generating word feature vectors as transfer learning approach and a feed forward neural network for identifying the tag of every word. For building the model, we have automatically built an EDU segmentation dataset from a Vietnamese constituent treebank NIIVTB and used this dataset to fine-tune PhoBERT pretrained model. The results show that our EDU segmentation model has span-based F1 score of 0.8, which is sufficient to be used in practical tasks.
Keywords: EDU segmentation; sequential labelling; BERT; transfer learning.
DOI: 10.1504/IJIIDS.2022.124090
International Journal of Intelligent Information and Database Systems, 2022 Vol.15 No.3, pp.249 - 266
Received: 10 Feb 2021
Accepted: 17 May 2021
Published online: 12 Jul 2022 *