Skip to main content

Learning the Morphology of Complex Synthetic Languages

The project objective is to apply advanced machine learning techniques in order to learn the rules for morphological analysis of synthetic (morphologically complex) languages providing both technology and tools to be applied in different domains of Speech/Language Technology.

Morphological Analysis is the decomposition of words into their constituents (morphemes) with the assignment of grammatical features to each of constituents. To take a simple example in English, the word unhappier is decomposed into the following components: un (adjectival negative prefix) + happy (adjectival stem) + er (comparative suffix) taking into account both the allowed sequence of word constituents and the changes of the orthographic shape of these constituents when they are concatenated with each other. Most morphological phenomena in the majority of European languages can be expressed by regular expressions (finite state techniques) like the example in the previous paragraph. This project, however, is largely concerned with the structurally more complex (mostly non-European) languages with recursive structures that require more powerful mechanisms than finite-state automaton (e.g., regular expressions enhanced with longer term variables).

A practical issue is the length of time required for creating morphological rules by hand, if this is done. The much more efficient alternative we propose is applying machine learning techniques to train on real data. This approach makes the project extremely challenging both from the research and application point of view.

A research goal of the project is to automatically decompose the word into its constituents by learning the rules for representing permissible sequences of word constituents and the rules that change the orthographic shape of the constituents. There is a large set of open problems in morphological learning that prevents from learning the whole set of morphological rules (e.g., learning of complex morphological recursive structures). We have chosen Inductive Logic Programming (ILP) for training as it has the crucial features of easily representing complex formalisms that are easily expanded by stochastic features. ILP methods can also induce rules directly from unbounded data sets such as strings, and this makes annotation and the training more naturally related to the underlying linguistics.

The proposed research will have a tremendous benefit for producing Text-to-Speech Systems in developing African and Asian countries (and the practical delivery will be enabled by the partnerships and contacts forged by the Local Language Speech Technology Initiative, see www.llsti.org). The automated morphological analysis tools developed in this project will facilitate the creation of intelligible Text-to-Speech (speech synthesis) systems that require morphological analysis (see above) for

  1. Automatic tone assignment, which is essential for most African languages.
  2. Proper prosody, which includes stress assignment required for Russian, and phrase prediction required for most world languages including European ones.
  3. Proper letter-to-sound rules required for the Indian languages Hindi and Telugu, the Turkish language and many others.

The research will provide the technology for the implementation of indigenous and minority language voice services offered by mobile network providers (such as information on healthcare, jobs, agriculture, the environment etc.) Another important application for this technology will be screen readers for blind people in many Asian and African countries.

Besides developing countries, UK-based companies that provide speech/language services for the languages in African/Asian countries, for the foreign and minority languages spoken in UK will benefit from the proposed technology.

The technologies and tools obtained in this work can be also used in other Language Processing fields as Machine Translation, spelling checkers, automated hyphenation, information retrieval etc.