A Hybrid Approach to Word Segmentation of Vietnamese TextsReport as inadecuate




A Hybrid Approach to Word Segmentation of Vietnamese Texts - Download this document for free, or read online. Document in PDF available to download.

1 KIWI - Knowledge Information and Web Intelligence LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications 2 MIM - Faculté de Mathématiques, Mécanique et Informatique 3 MSI - Modélisation et Simulation Informatique de systèmes complexes

Abstract : We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.





Author: Hong Phuong Le - Thi Minh Huyen Nguyen - Azim Roussanaly - Tuong Vinh Ho -

Source: https://hal.archives-ouvertes.fr/



DOWNLOAD PDF




Related documents