Jais is a large language model focused on Arabic and is currently the best open model of its kind.
Researchers from the United Arab Emirates, in collaboration with Cerebras, introduce two new open language models: Jais and Jais-chat. The models were trained on Arabic and English language and code, and significantly outperform existing open-source models for Arabic.
Jais is a 13 billion parameter model pre-trained with 395 billion tokens, of which 116 billion are Arabic tokens. Jais chat has been instruction tuned with an additional 10 million instruction/response pairs and outperforms all existing open Arabic/multilingual chatbots.
The models are the first Arabic-centric open models of this scale.
Jais can match ChatGPT in some tasks
Arabic websites, books, news, and Wikipedia were used as training data, with all data filtered before training. The 232 billion tokens of English data from The Pile by EleutherAI are used to compensate for the limited Arabic data available. The team also uses 46 billion code tokens.
In benchmarks, Jais and Jais-chat outperform existing, freely available Arabic models by 11 to 15 points in accuracy, and are competitive with Meta’s LLaMa2 for English, according to the team. Commercial models such as OpenAI’s ChatGPT or Anthropic’s Claude are still ahead on average in the benchmarks, but are also significantly larger. However, for some tasks, such as writing, Jais and Jais-chat are on par with ChatGPT, the team said.
The team also provides a number of other security mechanisms for Jais-chat, such as filters and classifiers for unwanted requests and output.
Another special feature of the model: it was not trained on Nvidia GPUs, but on Cerebra’s CS-2 systems. The company produces a wafer-sized AI chip that is installed in the CS-2 systems.