OpenAI GPT-4 is said to be based on the Mixture of Experts architecture and has 1.76 trillion parameters.
GPT-4 is rumored to be based on eight models, each with 220 billion parameters, which are linked in the Mixture of Experts (MoE) architecture. The idea is nearly 30 years old and has been used for large language models before, such as Google’s Switch Transformer.
The MoE model is a type of ensemble learning that combines different models, called “experts,” to make a decision. In an MoE model, a gating network determines the weight of each expert’s output based on the input. This allows different experts to specialize in different parts of the input space. This architecture is particularly useful for large and complex data sets, as it can effectively partition the problem space into simpler subspaces.
No statement from OpenAI, but the rumors are credible
The information about GPT-4 comes from George Hotz, founder of Comma.ai, an autonomous driving startup. Hotz is an AI expert who is also known for his hacking past: He was the first to crack the iPhone and Sony’s Playstation 3.
Other AI experts have also commented on Hotz’s Twitter feed, saying that his information is very likely true.
i might have heard the same 😃 — I guess info like this is passed around but no one wants to say it out loud.
GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.
Glad that Geohot said it out loud.
— Soumith Chintala (@soumithchintala) June 20, 2023
What can open-source learn from GPT-4?
The architecture may have simplified the training of GPT-4 by allowing different teams to work on different parts of the network. This would also explain why OpenAI was able to develop GPT-4’s multimodal capabilities independently of the currently available product and release them separately. In the meantime, however, GPT-4 may have been merged into a smaller model to be more efficient, speculated Soumith Chintala, one of the founders of PyTorch.
Hotz also speculated that GPT-4 produces not just one output, but iteratively 16 outputs that are improved with each iteration.
The open-source community could now try to replicate this architecture; the ideas and technology have been available for some time. However, GPT-4 may have shown how far the MoE architecture can go with the right training data and computational resources.