Machine Learning to improve morphological analysis
26 September 2006
EPSRC has awarded Professor Peter Flach and Dr Ksenia Shalonova GBP 350,000 for a project studying the use of advanced machine learning techniques such as inductive logic programming to learn the morphology of complex synthetic languages including Russian, Turkish and isiZulu (a Bantu language spoken in South Africa). The project is due to start on 1 October 2006 and runs for 3.5 years.
This project aims to apply advanced machine learning techniques in order to learn the morphology -- the way words are formed from constituents -- of synthetic (i.e., morphologically complex) languages. This will allow improved text-to-speech systems for complex languages such as isiZulu.
Morphological analysis is the decomposition of words into their constituents (morphemes) with the assignment of grammatical features to each of constituents. To take a simple example in English, the word unhappier is decomposed into the following components: un(adjectival negative prefix)+happy(adjectival stem)+er(comparative suffix) taking into account both the allowed sequence of word constituents and the changes of the orthographic shape of these constituents when they are concatenated. Most morphological phenomena in the majority of European languages can be expressed by finite-state techniques such as regular expressions. This project, however, is concerned with the structurally more complex synthetic languages (mostly non-European). These languages exhibit complex recursive morphological structures that require more powerful mechanisms than finite-state automata.
The main research goal of the project is to automatically decompose the word into its constituents by learning the rules for representing permissible sequences of word constituents and the rules that change the orthographic shape of the constituents. This involves tackling a set of open problems in morphological learning that prevents learning the whole set of morphological rules. We have chosen Inductive Logic Programming (ILP) for training as its logical foundations allow representing complex formalisms that can be expanded by stochastic features. ILP methods can also induce rules directly from unbounded data items such as strings, which makes annotation and training more naturally related to the underlying linguistics.
This project will have a tremendous benefit for producing Text-to-Speech Systems in developing African and Asian countries (and the practical delivery will be enabled by the partnerships and contacts forged by the Local Language Speech Technology Initiative). The automated morphological analysis tools developed in this project will facilitate the creation of intelligible Text-to-Speech systems that require morphological analysis for
(1) Automatic tone assignment, which is essential for most African languages.
(2) Proper prosody, which includes stress assignment required for Russian, and phrase prediction required for most world languages including European ones.
(3) Proper letter-to-sound rules required for the Indian languages Hindi and Telugu, the Turkish language and many others.
The research will provide the technology for the implementation of indigenous and minority language voice services offered by mobile network providers (such as information on healthcare, jobs, agriculture, the environment etc.)