All posts by loic

Computational challenges raised by the morphology of South African languages

South African languages present a variety of linguistic features. Let us have a quick (computational) look at the inside of words: morphology.

A small multiparallel corpus [1] of around 2000 sentences in all S.A. languages will allow us to get some quantitative insight into the difficulties raised by each language. Every English sentence has been translated into the other 10 languages. It allows us to compare the number of words or rather tokens, since we also count punctuation and the number of types – that is the number of different tokens. A language with a reach morphology (with for instance, markers on verbs for conjugation or cases on nouns etc) should display more token types. An agglutinative language in particular (where long words are formed, by composition of lexical items and addition of grammatical morphemes) will also display a lesser number of tokens.

Language #tokens #types
English 50k 8k
Afrikaans 55k 6k
Zulu 41k 13k
Northern Sotho 63k 5k

No need I guess to extend on English. Afrikaans, a language rooted from Dutch is the other language of the Germanic family in the set of 11 official languages. Compared with Dutch, it displays a more regular morphology [2]. Part-of-speech (POS) tagging (tagging each word with a label such as: noun, verb, adjective, preposition…) for Afrikaans works relatively well.

As far as Bantu languages are concerned, I will stick for now to Nguni (represented here by Zulu) and Sotho languages (represented here by Northern Sotho).

Even if due to their written form (conjunctive for the Nguni languages, disjunctive for the Sotho languages), each of those two results in distinctive challenges as far as computational processing is concerned.

Part-of-speech tagging

First, automatic tagging tasks, such as POS tagging are mostly performed using machine learning techniques. They learn statistical models from data- annotated (a number of sentences have been manually labelled) or not (raw text). This learning task is rendered more difficult by an extended vocabulary (a greater number of different tokens), since more of them will be seen once or too few times in the data. And new data (unseen text) will be more likely to contain unseen words. That is a general problem called “data sparsity”.

Thus, among the four languages presented here, Zulu offers the greatest challenge: rare, long words – a similar situation to Finnish. The table above shows how the Zulu version does present fewer words, with more types.

Creating a rule-based analyser (manually write rules to annotate the data) might still be problematic. First of all, such approach requires a lexicon to be able to parse the data. Second of all, there will always remain a level of ambiguity with which a rule-based approach might struggle with, for lack of a quantified, data-supported method to solve it.

Northern Sotho is expected to be easier for that reason: every morpheme is written as a separate word, instead of being glued to the lexical unit it applies to, as in Zulu. The consequence is a greater number of tokens, with fewer types: data sparsity will not hurt as much.

Word alignment

Second, such linguistic features will have consequences for the task of aligning words across languages (in bilingual corpora). Such a task is performed automatically using statistical models, such as the famous IBM models (). These models rely on the assumption of small local variations around a one-one alignment with words coming in the same order between the two languages.

Let us envisage the case of English-to-XX alignment.

Long distance movement between languages, as in the case of Afrikaans double negation, hinders word alignment:

Die plig om ‘n familielid te onderhou is nie beperk tot die onderhoud van ‘n kind nie .

(the duty to support a family member is not limited to supporting a child .)

But the major issue lies within agglutination, either by composition of new lexical units (Afrikaans), or by combination of multiple grammatical morphemes around the lexical stem (Zulu and to a much lesser extent, Northern Sotho).

The following histogram illustrates the discrepancies between the four languages. It shows the ratio of token types for each range of character length.


Zulu probably displays the greater discrepancy with English. One possibility to improve alignment with English is to process the corpus with a morphological analyser, such as Morfessor [3], and thus reduce the gap between ENG and the other language.

Northern Sotho, because of its disjunctive writing, is expected to be more easily aligned using the algorithms mentioned earlier.

Finally, here are examples of long words (longest words in our small multiparallel corpus):

Afrikaans: ontwikkelingsfinansieringsinstansies

[ontwikkeling s finansiering s instansie s]

(institutions of financial development)

Northern Sotho: seswantšhokakaretšo

(the whole picture)

Zulu: kwayisishiyagalombili

(eight) (sic!)



[1]Eiselen, E. & Puttkammer, M. Developing text resources for ten South African languages Proc. LREC, 2014 Link
[2]Comparison of Afrikaans and Dutch (Wikipedia) Link
[3]Morfessor 2.0: Toolkit for statistical morphological segmentation Link

technology for multilingual terminologies

As I mentioned in an earlier post, South African academies seem to be buzzing about multilingualism. In that prospect, technology surely has its part to play.

Indeed, providing a multilingual access to education may involve publishing learning material in the involved languages. That may be a case for the use of machine translation or computer aided translation.

A lesser version of this would be to provide students with linguistic help tools. For instance, such a tool that would provide a translation of a subject-specific phrase or “term” in the student’s language. Ideally with some definition for it.

A first step into that direction would be, for each subject, to identify its specific lexicon. This can be learnt from corpus, precisely  from existing learning material. Such learning material is likely to be available in the dominant language (in our case, English and, to a lesser extent, Afrikaans) from which we may want to translate, for the purpose of knowlege dissemination.  We would like to learn which are the terms specific to a subject, and which are more specific than others, by assigning each term a subject-specific score. Once term phrases are extracted, a basic implementation of that idea would make use of term frequency–inverse document frequency (see end of post). By composing the number of times a term is seen in a document (or, alternatively, in a range of documents within a given topic) by the (inverse of the) number of documents (or topics) containing that term, it gives an indication of its relevance to the topic.

Snapshot from a Physics tutorial, in English.

Here is what I got for the subject of physics in one of my experiments, extracting noun phrases from learning material corpus.

Relevance RankTerm
2magnetic field
3Physics Department
7home experiment
9electric field
10potential difference
15science library
21Physics PO Box
22simple pendulum
25practical session

We might want to exclude some of those terms (like the trigonometric functions or the measure units). The term “practical session”, might better be ranked lower, since it could also belong to a dfew other subjects, such as biology, geology but why not also arts or engineering.

Now can we perform such an extraction of (monolingual) terms in another language, for instance Northern Sotho?

Let’s have a look at a first attempt on extracting terms in a single document, in the domain of labour law. Please note that this time, terms are ranked on their frequency in the document, not on their relevance to the domain.

Frequency RankTerm
2bala ga tlaleletšo
4diteng tša kgaolo
7dikgokagano tša go hloka
8lefaseng la batho
10tlhalošo ya maleba
13tšwetšopele ya dingangišano
16dikabo ka diselaete
17dikabo tša go fiwa
18melaetša ya go ngwalwa
19mareo a motheo
20lefelong la mošomo
21tšhate ya para
23dikgokagano tša go leka
24dithuši tša pono ya mahlo

Both those examples were obtained usinig a terminology extraction pipeline we are busy building. Such technology involves the use of technological components that are more or less standard and readily available (actually plentiful, for English) or rarer, still to be crafted and generally speaking less efficient for less-resourced languages. That involves tools to automatically identify the language of a text, splitting the text into sentences, words…tagging those words with a part of speech (is it a Noun , a Verb? an Adjective ?… ), using some grammar to extract phrases and not only single words. Then comes the statistical machinery that turns word counts into meaningful scores (such as the relevance score previously mentioned).

To go further, one could turn to translated material as a source for term extraction, extracting or sorting this time, bilingual or multilingual terms. Such a resource would be precious for the purpose of translation (however being extracted from… existing translations!) or the crafting of software tools for linguistic support.











I am learning Zulu – Ngiyasifunda isiZulu!

I have always been keen on learning the local language wherever I found myself. At least, I would try. Now in South Africa, one is faced with no less than 11 official languages. Which one should I learn? Maybe I can skip English all right. Remains 10…

That is one issue. Another one is, how those languages are spread over the country. If only I could choose the one that is spoken locally, I’d be content with that. While this could make sense in provinves like KwaZulu-Natal (predominantly Zulu) or the Western Cape (Afrikaans and some Xhosa), in the most populous and urbanized province of Gauteng where I stay, this is another matter. Here, the whole of South Africa meets.


(picture: Wikipedia)

Can you spot Gauteng? Perhaps the smallest province, but with drops of all colours in it.

In townships, people seem to mix those languages. I know of one township whose name, Soshanguve (north of Pretoria), explicitely roots from a gathering of all indigenous populations (a political will of ruling over people’s habitat at the time) : Sotho – Shangani –Nguni – Venda.

  • The Sotho languages group together Southern and Northern Sotho with Tswana.
  • Nguni languages include Ndebele, Swati, Zulu and Xhosa.
  • Shangani is another name for Tsonga
  • and Venda is another less-spoken language in the Eastern part of the country.

The mish-mash of languages spoken in Gauteng, between Johannesburg and Pretoria is often associated with Tsotsitaal, the slang or street lingo. From what I understand, tsotsi is a Zulu word for a ‘thug’, and taal is Afrikaans for ‘language’.

Now you would think that would make perfect sense for me to pick tsotsitaal over all the other. I will hopefully catch some of it anyway, but I wanted to try and learn some language I could use for NLP. Back in 2000, I had caught a few words of Afrikaans and Pedi, which is the closest you can get to ‘local language’ in Pretoria. This time, motivated by both the easiness of learning the most widely understood autochtonous language and some already existing NLP work, I am focusing on Zulu instead. I found out that whenever Black students were meeting, of different language groups, Zulu was the one they would switch to. Alright, I would still love to get some Pedi or even Venda (apparently speaking Venda qualifies you as a language genius here, while many claim to speak at least six of the official languages or… all of them), I would love to praat Afrikaans but hey, let me start somewhere.

Initially, I bought a book, did not find it very useful. I lent it to a Zulu friend of mine for inspection, and I have not seen it since then. Actually, it does not matter much. I’m getting an almost daily dose of Zulu through memrise (the free version, see below). Or I watch Dingani’s videos directly on Youtube.

I still can’t do much more than greeting and thanking people, bidding farewell in a polite manner or say something like : “the boy sees the dog” (umfana ubona inja). But I’m keeping up and hopefully soon I’ll be able to grasp the Zulu lyrics in that Johhnny Clegg’s song I heard in childhood. Or in the maskandi blues-like Zulu music.

Kancane kancane… (little by little)


Here’s a tsotsitaal dictionary

Memrise your Zulu

Zulu with Dingani



A national conference on multilingualism in higher education

Last August and for two days, UNISA hosted the National Conference on  Multilingualism in Higher Education. Academics and language practitioners from all around South Africa came together to present their initiatives and discuss the issues at stake.

For instance, North West University has started to become a trilingual university (English, Afrikaans,  Setswana). It already provides interpreting services for those languages.

At the University of KwaZulu Natal, learning Zulu is now compulsary for all students and a team around Dr Langa Khumalo is working actively into producing multilingual terminologies. His own talk introduced a tongue-twisting term often heard during those two days: intellectualisation. In short, it refers to the process of empowering a language of communication into a language of learning and teaching.

Prof. Mbulungeni from the University of Cape Town (UCT) gave a keynote address.  UCT seems to also have a terminology project with an online platform (but I could not retrieve the link).

As for the University of South Africa (UNISA), it has already initiated the translation of selected learning modules into all official languages. Prof. Koliswa Moropa,  Khetiwe Marais and Feziwe Shoba presented this ongoing effort and the challenges it arises.

Prof. Laurette Pretorius presented our own effort at the Academy of African Languages and Science to build language resources from institutional content in
Higher Education. My own bit there mainly deals with terminology extraction for the South African languages, on which I have the pleasure to work with Friedel Wolff. He himself gave a talk on the more or less multilingual nature of institutional websites.

Overall, a pretty good conference, very stimulating for those of us working in the realm of mother-tongue education, be it on an institutional or technological level.


Conference poster and full programme

The Academy of African Languages and Science at UNISA

UKZN  Language planning and development office

Interpreting services at NWU

South African rainbow tongue

africa_tongue The South African constitution of 1996 recognizes no less than 11 official languages.

They comprise Southern Bantu languages, plus Afrikaans and English.

I guess I can skip introducing English. As for Afrikaans, this is a Germanic language, evolved from the Dutch spoken by European settlers, originally in the Cape peninsula.

Among the Southern Bantu languages, we find the Nguni languages: Ndebele, Swati, Zulu, Xhosa. They are easily recognizable by their click sounds, that they borrowed from the Khoi-San languages.

The Sotho languages form another sub-family with Sotho, Northern Sotho (a.k.a. Pedi), Tswana.

Finally, there remain two outliers, located in the North-Eastern part  of the country: Tsonga (also spoken in Mozambique) and Venda.

This linguistic context is one of the reasons which brought me here, in South Africa and at the University of South Africa (UNISA). UNISA is a major and massive (400,000 students!) comprehensive, distance-learning university, the first in size in South Africa and on the whole continent.

As such, it has a specific mission in serving the whole population, regardless of the language they speak. I will come back on that in another post!



UNISA or the University of South Africa.

Listen to the clicks.