This thesis demonstrates that machine learning can be applied in different ways to automate the analysis of morphologically complex agglutinating languages.
Firstly, the target language Zulu, an under-resourced indigenous language of South Africa, is characterised before presenting the Ukwabelana Corpus. The morphological Zulu corpus has been semi-automatically compiled in close cooperation with a linguistic expert and is the first publicly available corpus of its kind. It is statistically described and has been deployed for testing algorithms.
Secondly, the thesis introduces the novel evaluation metric EMMA and confirms that unsupervised morphological analysis is best assessed using a hard assignment between predicted and ground truth morpheme-label pairs.
Thirdly, the probabilistic generative model Promodes is presented for the task of word decomposition. Parameter estimation is carried out by maximum likelihood estimation in a supervised and by expectation maximisation in an unsupervised setting. It is shown that the calibration of the decision threshold and combining different models through a committee or an ensemble can improve results.
Finally, two approaches for labelling morphemes, either in a post-processing step or simultaneously with word decomposition, are described. The former revisits the assignment between predicted and ground truth morphemes in EMMA and the latter, called DEAP, performs deductive-abductive parsing. DEAP first induces a context-free grammar from training examples, hypothesises possible parses and then either selects the top ranking ones or performs part-of-speech (POS) disambiguation by reverting to the syntactic context and to a morphological POS tagger.
In the experimental part of the thesis, all algorithms are evaluated on the target language Zulu as well as on other languages including English, Finnish, German and Turkish, and compared against reference methods.