Skip to Main content Skip to Navigation
Conference papers

Optimizing Word Alignments with Better Subword Tokenization

Anh Khoa Ngo Ho 1 François yvon 1 
1 TLP - Traitement du Langage Parlé
LISN - Laboratoire Interdisciplinaire des Sciences du Numérique, STL - Sciences et Technologies des Langues
Abstract : Word alignments identify translational correspondences between words in a parallel sentence pair and are used, for example, to train statistical machine translation, learn bilingual dictionaries or to perform quality estimation. Subword tokenization has become a standard preprocessing step for a large number of applications, notably for state-of-the-art open vocabulary machine translation systems. In this paper, we thoroughly study how this preprocessing step interacts with the word alignment task and propose several tokenization strategies to obtain well-segmented parallel corpora. Using these new techniques, we were able to improve baseline word-based alignment models for six language pairs.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03322842
Contributor : Anh Khoa NGO HO Connect in order to contact the contributor
Submitted on : Thursday, August 19, 2021 - 7:09:54 PM
Last modification on : Sunday, June 26, 2022 - 3:12:01 AM
Long-term archiving on: : Saturday, November 20, 2021 - 7:19:54 PM

File

Optimizing Word Alignments wit...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03322842, version 1

Citation

Anh Khoa Ngo Ho, François yvon. Optimizing Word Alignments with Better Subword Tokenization. The 18th biennial conference of the International Association of Machine Translation, Aug 2021, Miami (virtual), United States. ⟨hal-03322842⟩

Share

Metrics

Record views

55

Files downloads

137