A team of computer scientists at the University of Bristol has won an international competition on unravelling the structure of words by means of machine learning methods. The team, consisting of PhD students Sebastian Spiegler and Bruno Golenia (both pictured) and their advisor Prof Peter Flach, submitted several entries. One of these, developed by Sebastian Spiegler, won the competition in five out of six cases: Arabic (with and without vowels), Finnish, German and Turkish. Defeat was conceded only in English, which generally consists of much shorter words.
The Morpho Challenge is an annual scientific contest held as part of the PASCAL Network of Excellence funded by the European Union. The goal of the challenge is the advancement of machine learning algorithms that discover meaningful vocabulary units in words, called morphemes, to describe the underlying phenomena of word construction in natural languages. These units are used in many different domains of speech and language technology, including text-to-speech systems, automatic speech recognition, machine translation, information retrieval and spell-checkers.
The research into machine learning methods for morphology is carried out as part of the EPSRC-funded project "Learning the morphology of complex synthetic languages", which focuses on indigenous and under-resourced languages where little labelled data and financial resources are available. In collaboration with the Meraka Institute in Pretoria and the University of Witwatersrand in Johannesburg the results of the project will be used to improve text-to-speech systems for indigenous South-African languages such as Zulu and Xhosa. Automatic machine learning methods offer great promise for such languages which are of insufficient commercial interest to develop handcrafted morphological dictionaries.
Specific methods applied by the team include sophisticated statistical models, as well as computationally intensive algorithms that necessitate the use of BlueCrystal, the University's state-of-the-art high-performance computing cluster. The work forms part of the Exabyte Informatics research theme, which deals with the challenges and opportunities of data-intensive computing.