Arabic Chatbot

July 23, 2019
By Botpress Team

Creating a high functioning Arabic chatbot just got a lot easier. A recent natural language processing (NLP) breakthrough makes it much easier to map out the structure of Arabic (and any language for that matter) which is critical for improving the overall performance of Arabic NLP.

Creating a language model

The first step in creating a language model for any language is identifying the discrete units of meaning in that language. In English for example, the discrete units of language could be words in a sentence and the delimiter in that case would be the spaces between the words.  The terminology in the NLP world for these discrete units of meaning are tokens and the process of parsing up the text to identify the tokens is called tokenization.

Once the tokens have been identified, they can be fed into an algorithm that maps out the relation of the tokens to each other.  For example, which words are synonyms for each other, which words are close in meaning, which words are opposite in meaning etc.  This process of mapping out the relative meanings of the tokens is called training the language model.

The model needs to be trained on relevant text, normally called a “corpus” of text.  Ideally, this text contains plentiful examples of vocabulary that is relevant to the chatbot.  For example, if the chatbot is going to be used by a Telco, it would be helpful to train the model on text that is relevant to the industry, such as conversations around buying credits for calls and data.

Generally chatbot AI platforms offer language models out of the box that are trained on a generic corpus of text such as Wikipedia but to get the best possible performance it is sometimes necessary to train the model on more industry or task specific text as mentioned above.

Arabic language

Arabic, by the nature of the language, makes doing all of the above more challenging than it would be for other popular languages.  

This is because Arabic as a language is less systematic and orderly than other common languages.  

The main problem with Arabic is it is difficult to tokenize for the following reasons. 

  • Arabic has a rich and complex grammatical structure.  For example more than one concept is often embedded in the same word.
  • When Arabic is written, vowels are omitted from the words.
  • Arabic has its own unique set of characters.
  • Arabic is much more prone to ambiguity as it allows for much more freedom in the way that sentences are put together.  Where English sentences are generally structured as Subject-Verb, Arabic sentences can be put together in any order.

Aside from the above complexity, building Arabic language models have some practical challenges.  Firstly it is written from right to left and secondly it has many dialects.  These dialects can to some extent be considered separate but related languages as speakers of one dialect cannot necessarily understand other dialects.  

All these factors have made it very challenging to tokenize Arabic accurately and this therefore has resulted in Arabic NLP being inferior to the NLP for other languages.  A contributing problem as also been that there has been less research effort on Arabic NLP than on the common western languages, in particular, English.

The difficulty in tokenizing Arabic meant that data scientists needed to apply much more manual effort in tokenizing the language and even then were not able to come up with a truly reliable language model.

The breakthrough

The breakthrough however came in late 2018.  The breakthrough was applying machine learning to the tokenization step.  What this meant was that the AI could automatically tokenize Arabic or any other language.  This meant that the tokenization could be much more reliable and be done much more easily.  No longer did each language need to be tokenized independently with customized models, all languages could be treated the same.  The NLP became “language agnostic” in the jargon of machine learning.

This has particular significance for hard to tokenize languages such as Arabic.

Of course, just because this technology is available, doesn’t mean that all Arabic chatbots will instantly improve.  Firstly the Chatbot AI Arabic platforms need to update their existing NLP with the new algorithms and this is not something they will do automatically given their investment in the previous technology.  

In addition, there are many factors that determine the quality of the chatbot, not just the quality of the NLP understanding in Arabic.  Having the best overall NLP algorithms is also obviously critical.  For example, how well does the NLP disambiguate between different intents of similar meaning or how good is it at ignoring out-of-scope phrases is important.  The NLP model should allow designers to control sophisticated slot filling and context driven understanding.   

Even if you are using the best NLP algorithms, the chatbot won’t provide a good experience for the end user unless it is well designed in other respects.  The chatbot needs to implement sophisticated flow and other programmatic logic and be integrated with third party systems in a seamless way.   The designer needs tools that allow them to easily control how multiple languages are handled and debug and test the underlying models and flows.  The quality of the platform for building the chatbot will determine the quality of the chatbot experience that can be achieved.

The advances in Arabic chatbots will lead to greater adoption of this technology in the Arabic world and developing best practices for Arabic chatbots.  Expect to see rapidly improving Arabic chatbot experiences in the near future.