technology for multilingual terminologies

As I mentioned in an earlier post, South African academies seem to be buzzing about multilingualism. In that prospect, technology surely has its part to play.

Indeed, providing a multilingual access to education may involve publishing learning material in the involved languages. That may be a case for the use of machine translation or computer aided translation.

A lesser version of this would be to provide students with linguistic help tools. For instance, such a tool that would provide a translation of a subject-specific phrase or “term” in the student’s language. Ideally with some definition for it.

A first step into that direction would be, for each subject, to identify its specific lexicon. This can be learnt from corpus, precisely  from existing learning material. Such learning material is likely to be available in the dominant language (in our case, English and, to a lesser extent, Afrikaans) from which we may want to translate, for the purpose of knowlege dissemination.  We would like to learn which are the terms specific to a subject, and which are more specific than others, by assigning each term a subject-specific score. Once term phrases are extracted, a basic implementation of that idea would make use of term frequency–inverse document frequency (see end of post). By composing the number of times a term is seen in a document (or, alternatively, in a range of documents within a given topic) by the (inverse of the) number of documents (or topics) containing that term, it gives an indication of its relevance to the topic.

Snapshot from a Physics tutorial, in English.

Here is what I got for the subject of physics in one of my experiments, extracting noun phrases from learning material corpus.

Relevance RankTerm
2magnetic field
3Physics Department
7home experiment
9electric field
10potential difference
15science library
21Physics PO Box
22simple pendulum
25practical session

We might want to exclude some of those terms (like the trigonometric functions or the measure units). The term “practical session”, might better be ranked lower, since it could also belong to a dfew other subjects, such as biology, geology but why not also arts or engineering.

Now can we perform such an extraction of (monolingual) terms in another language, for instance Northern Sotho?

Let’s have a look at a first attempt on extracting terms in a single document, in the domain of labour law. Please note that this time, terms are ranked on their frequency in the document, not on their relevance to the domain.

Frequency RankTerm
2bala ga tlaleletšo
4diteng tša kgaolo
7dikgokagano tša go hloka
8lefaseng la batho
10tlhalošo ya maleba
13tšwetšopele ya dingangišano
16dikabo ka diselaete
17dikabo tša go fiwa
18melaetša ya go ngwalwa
19mareo a motheo
20lefelong la mošomo
21tšhate ya para
23dikgokagano tša go leka
24dithuši tša pono ya mahlo

Both those examples were obtained usinig a terminology extraction pipeline we are busy building. Such technology involves the use of technological components that are more or less standard and readily available (actually plentiful, for English) or rarer, still to be crafted and generally speaking less efficient for less-resourced languages. That involves tools to automatically identify the language of a text, splitting the text into sentences, words…tagging those words with a part of speech (is it a Noun , a Verb? an Adjective ?… ), using some grammar to extract phrases and not only single words. Then comes the statistical machinery that turns word counts into meaningful scores (such as the relevance score previously mentioned).

To go further, one could turn to translated material as a source for term extraction, extracting or sorting this time, bilingual or multilingual terms. Such a resource would be precious for the purpose of translation (however being extracted from… existing translations!) or the crafting of software tools for linguistic support.











Leave a Reply

Your email address will not be published. Required fields are marked *