Skip to main content

UNGRADE: UNsupervised GRAph DEcomposition

Bruno Golenia, Sebastian Spiegler, Peter Flach, UNGRADE: UNsupervised GRAph DEcomposition. Working Notes for the CLEF 2009 Workshop, Corfu, Greece. September 2009. PDF, 71 Kbytes. External information

Abstract

This article presents an unsupervised algorithm for word decomposition called UNGRADE (UNsupervised GRAph DEcomposition) to segment any word list of any language. UNGRADE assumes that each word follows the structure prei??xes, a stem and sufi??xes without giving a limit on the number of prei??xes and sufi??xes. The UNGRADEa??s algorithm works in three steps and is language independent. Firstly, a pseudo stem is found for each word using a window based on Minimum Description Length. Secondly, prei??x sequences and sufi??x sequences are found independently using a graph algorithm called graph-based unsupervised sequence segmentation. Finally, the morphemes from previous steps are joined to provide a segmented word list. We focus purely on the segmentation of words, thus, we employ a trivial method for labeling each morpheme which is the segment of the morpheme itself. UNGRADE is applied to 5 languages (English, German, Finnish, Turkish and Arabic) and results are provided according to their gold standard.

Bibtex entry.

Contact details

Publication Admin