Which granularity to bootstrap a multilingual method of document alignment: character N-grams or word N-gramsReport as inadecuate

Which granularity to bootstrap a multilingual method of document alignment: character N-grams or word N-grams - Download this document for free, or read online. Document in PDF available to download.

1 Equipe Hultech - Laboratoire GREYC - UMR6072 GREYC - Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen

Abstract : This article tackle multilingual automatic alignment. Alignment refers to the process by which segments that are translation ofone another are automatically matched. Instead of comparing only pairs of languages at sentence level, as it is usually done toconform to human process in translation. The computer is used here for its capacity to infer semantic alignment from a collection oftexts that are translations of the same content. The corpus contains press releases from Europa, the European Community website,available in up to 23 languages. The alignment process takes advantage of frequency similarity between different linguistic versionsof a document by computing matching features for each repeated string in all versions. This is done to find reliable anchors inthe process of linking versions. The question of the best granularity is raised to bring out some semantic equivalences, whencomparing two linguistic versions, character N-grams or word N-grams. The alignment systems are traditionally based on wordN-grams splitting. The observation of the morphological variety of languages, even inside a single linguistic family, quickly showsthat the word granularity is inadequate to provide a widely multilingual system, i.e. a language independent system able to handleflexional languages as well as positional languages. Instead, when starting from a multilingual collection to focus on pairs of texts,we defend that character N-grams alignment is more efficient than word N-grams alignment.

Keywords : multidocuments alignment matching character N-grams based method Corpus linguistic Natural Language Processing NLP multilinguism

Author: Charlotte Lecluze - Loïs Rigouste - Emmanuel Giguet - Nadine Lucas -

Source: https://hal.archives-ouvertes.fr/


Related documents