Monthly Archives: December 2015

technology for multilingual terminologies

As I mentioned in an earlier post, South African academies seem to be buzzing about multilingualism. In that prospect, technology surely has its part to play.

Indeed, providing a multilingual access to education may involve publishing learning material in the involved languages. That may be a case for the use of machine translation or computer aided translation.

A lesser version of this would be to provide students with linguistic help tools. For instance, such a tool that would provide a translation of a subject-specific phrase or “term” in the student’s language. Ideally with some definition for it.

A first step into that direction would be, for each subject, to identify its specific lexicon. This can be learnt from corpus, precisely  from existing learning material. Such learning material is likely to be available in the dominant language (in our case, English and, to a lesser extent, Afrikaans) from which we may want to translate, for the purpose of knowlege dissemination.  We would like to learn which are the terms specific to a subject, and which are more specific than others, by assigning each term a subject-specific score. Once term phrases are extracted, a basic implementation of that idea would make use of term frequency–inverse document frequency (see end of post). By composing the number of times a term is seen in a document (or, alternatively, in a range of documents within a given topic) by the (inverse of the) number of documents (or topics) containing that term, it gives an indication of its relevance to the topic.

crocodile-clip
Snapshot from a Physics tutorial, in English.

Here is what I got for the subject of physics in one of my experiments, extracting noun phrases from learning material corpus.

Relevance RankTerm
1Physics
2magnetic field
3Physics Department
4sin
5kg
6cm
7home experiment
8equation
9electric field
10potential difference
11radius
12magnitude
13eq
14resistor
15science library
16angle
17fl
18wire
19ammeter
20coil
21Physics PO Box
22simple pendulum
23experiment
24Hz
25practical session

We might want to exclude some of those terms (like the trigonometric functions or the measure units). The term “practical session”, might better be ranked lower, since it could also belong to a dfew other subjects, such as biology, geology but why not also arts or engineering.

Now can we perform such an extraction of (monolingual) terms in another language, for instance Northern Sotho?

Let’s have a look at a first attempt on extracting terms in a single document, in the domain of labour law. Please note that this time, terms are ranked on their frequency in the document, not on their relevance to the domain.

Frequency RankTerm
1šomana
2bala ga tlaleletšo
3ao
4diteng tša kgaolo
5mmotšule
6boitekolo
7dikgokagano tša go hloka
8lefaseng la batho
9sephorofešene
10tlhalošo ya maleba
11tšhomišong
12kgaolong
13tšwetšopele ya dingangišano
14onlaene
15karolwana
16dikabo ka diselaete
17dikabo tša go fiwa
18melaetša ya go ngwalwa
19mareo a motheo
20lefelong la mošomo
21tšhate ya para
22dikholego
23dikgokagano tša go leka
24dithuši tša pono ya mahlo
25hlogo

Both those examples were obtained usinig a terminology extraction pipeline we are busy building. Such technology involves the use of technological components that are more or less standard and readily available (actually plentiful, for English) or rarer, still to be crafted and generally speaking less efficient for less-resourced languages. That involves tools to automatically identify the language of a text, splitting the text into sentences, words…tagging those words with a part of speech (is it a Noun , a Verb? an Adjective ?… ), using some grammar to extract phrases and not only single words. Then comes the statistical machinery that turns word counts into meaningful scores (such as the relevance score previously mentioned).

To go further, one could turn to translated material as a source for term extraction, extracting or sorting this time, bilingual or multilingual terms. Such a resource would be precious for the purpose of translation (however being extracted from… existing translations!) or the crafting of software tools for linguistic support.

 

 

 

 

 


FURTHER

TF-IDF

 

 

 

I am learning Zulu – Ngiyasifunda isiZulu!

I have always been keen on learning the local language wherever I found myself. At least, I would try. Now in South Africa, one is faced with no less than 11 official languages. Which one should I learn? Maybe I can skip English all right. Remains 10…

That is one issue. Another one is, how those languages are spread over the country. If only I could choose the one that is spoken locally, I’d be content with that. While this could make sense in provinves like KwaZulu-Natal (predominantly Zulu) or the Western Cape (Afrikaans and some Xhosa), in the most populous and urbanized province of Gauteng where I stay, this is another matter. Here, the whole of South Africa meets.

548px-South_Africa_2011_dominant_language_map.svg

(picture: Wikipedia)

Can you spot Gauteng? Perhaps the smallest province, but with drops of all colours in it.

In townships, people seem to mix those languages. I know of one township whose name, Soshanguve (north of Pretoria), explicitely roots from a gathering of all indigenous populations (a political will of ruling over people’s habitat at the time) : Sotho – Shangani –Nguni – Venda.

  • The Sotho languages group together Southern and Northern Sotho with Tswana.
  • Nguni languages include Ndebele, Swati, Zulu and Xhosa.
  • Shangani is another name for Tsonga
  • and Venda is another less-spoken language in the Eastern part of the country.

The mish-mash of languages spoken in Gauteng, between Johannesburg and Pretoria is often associated with Tsotsitaal, the slang or street lingo. From what I understand, tsotsi is a Zulu word for a ‘thug’, and taal is Afrikaans for ‘language’.

Now you would think that would make perfect sense for me to pick tsotsitaal over all the other. I will hopefully catch some of it anyway, but I wanted to try and learn some language I could use for NLP. Back in 2000, I had caught a few words of Afrikaans and Pedi, which is the closest you can get to ‘local language’ in Pretoria. This time, motivated by both the easiness of learning the most widely understood autochtonous language and some already existing NLP work, I am focusing on Zulu instead. I found out that whenever Black students were meeting, of different language groups, Zulu was the one they would switch to. Alright, I would still love to get some Pedi or even Venda (apparently speaking Venda qualifies you as a language genius here, while many claim to speak at least six of the official languages or… all of them), I would love to praat Afrikaans but hey, let me start somewhere.

Initially, I bought a book, did not find it very useful. I lent it to a Zulu friend of mine for inspection, and I have not seen it since then. Actually, it does not matter much. I’m getting an almost daily dose of Zulu through memrise (the free version, see below). Or I watch Dingani’s videos directly on Youtube.

I still can’t do much more than greeting and thanking people, bidding farewell in a polite manner or say something like : “the boy sees the dog” (umfana ubona inja). But I’m keeping up and hopefully soon I’ll be able to grasp the Zulu lyrics in that Johhnny Clegg’s song I heard in childhood. Or in the maskandi blues-like Zulu music.

Kancane kancane… (little by little)


FURTHER

Here’s a tsotsitaal dictionary

http://www.bilingo.co.za/tsotsitaal-dictionary

Memrise your Zulu

http://www.memrise.com/course/82/zulu-with-dingani/

Zulu with Dingani

https://www.youtube.com/user/ZuluWithDingani

asimbonanga!

http://www.metrolyrics.com/asimbonanga-lyrics-johnny-clegg-savuka.html