Unsupervised Word Decomposition with the Promodes AlgorithmSebastian Spiegler, Bruno Golenia, Peter Flach, Unsupervised Word Decomposition with the Promodes Algorithm. Chapter in Multilingual Information Access Evaluation, Lecture Notes in Computer Science. February 2010. PDF, 277 Kbytes.
We present PROMODES, an algorithm for unsupervised word decomposition, which is based on a probabilistic generative model. The model considers segment boundaries as hidden variables and includes probabilities for letter transitions within segments. For the Morpho Challenge 2009, we demonstrate three versions of PROMODES. The first one uses a simple segmentation algorithm on a subset of the data and applies maximum likelihood estimates for model parameters when decomposing words of the original language data. The second version estimates its parameters through expectation maximization (EM). A third method is a committee of unsupervised learners where learners correspond to different EM initializations. The solution is found by majority vote which decides whether to segment at a word position or not. In this paper, we describe the probabilistic model, parameter estimation and how the most likely decomposition of an input word is found. We have tested PROMODES on non-vowelized and vowelized Arabic as well as on English, Finnish, German and Turkish. All three methods achieved competitive results.