This section is designed for those who want to understand how Swahili Wordnet is built.
There are currently two methods for constructing a Wordnet ;
- Expand approach
This approach translate Princeton to target language, take over the relation and revise. This has been a method of choice in building most wordnet and usually translation to target language is done with Machine translation using Deep Learning (large bilingual text corpus ), though faster as compared to manual translating which takes time and is expensive, the manual translation quality is better and more accurate.
- Merge approach
This approach define synsets and relations in your own language and then align the wordnet with the Princeton using the equivalence relation. Though not used much, it has the advantage of maintaining the morphological richness of a language.
For the building of Wordnet, we decided to go for the unconvectional root, using the two approaches together; this decision was reached due to the following reasons;
Our first target is to translate the 71 base synsets from Eurowordnet , under each base synset were other sysnets. While doing translations, we noticed not all words have direct Swahili translation, rather than discarding the word that seemed closer to the word in translation, we inputed the word entry, its definition, part of speech tag and synonyms and any other word that relate to that word be it hypernym or hyponym.
We want to maintain Swahili morphological richness as much as possible.
Wordnet consists of the core base synsets and other synsets. The most common is the 1310 base synsets which mark as a foundation in building various Wordnets. For the project not to be far-fetched due to the limited number of personnel and expertise; starting with a minimal base of 71 base Types as proposed by Eurowordnet Ontology  seems plausible.
- The first level of the base type Ontology is divided into;
Entity - concrete things (1st Order)
Concept - Concept, ideas in mind (3rd Order)
Event - Happening involves change (2nd Order)
State - State Situations, static (2nd Order)
We won’t go much into discussing the above, readers are advised to follow the link to learn more about 71 base Types.
The figure below shows the proposed summary flow;
We start by 71 base synsets, pick a word, find its translation from bilingual dictionaries and online resouces, to ensure high accuracy we compare the words returned by the various lexical resources, if a word appears more than once, then it is assumed to be the correct word, in cases where a word appears only once, we use our intuition to judge if the specific word is the correct one. if it is, we extract its definition from local kamusi (dictionaries), part of speech tag, usage and noun class and link the word with Princeton Wordnet through Interlingual Index (ILI).
If the word is not entirely found, the english word is discarded and incases where the word has its translation but the meaning does not concide, the Swahili word will go through the Merge approach. This is done with other Swahili words which are not directly translated to English such as the 24 Swahili Verbs.
The tool in use in developing Swahili Wordnet is DebVisc which is a client-Server side, but due to its shortcomings such as not supporting words that have apostrophe i.e the word ng’oa, a swahili word which means to remove an object out of something. From an English language perspective this is understandable because most words with apostrophe are contractions (reduced word form ie cannot to can’t) but in Swahili there are many words in such forms. So it will be in our best interest to build our own lexical building tool that fully caters for Swahili language and that which will aid in building other local languages lexical resources.