Deeply revealing the key architecture of MoE:GPT-4, becoming the killer mace of open source model counterattack. 07/12 Update SLTechnology News&Howtos

Deeply revealing the key architecture of MoE:GPT-4, becoming the killer mace of open source model counterattack.

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Mistral shocked the open source community last weekend with his big model of MoE. What exactly is MoE? How does it improve the performance of large language models?

Mistral's magnetic link last weekend shocked the open source community. The performance of this 7B × 8E open source MoE model has reached the level of LLaMA2 70B!

According to Jim Fan's guess, if Mistral had internally trained 34B×8E or even 100B+×8E models, their capabilities would have been infinitely close to GPT-4.

In previous exposures to the GPT-4 structure, most of the information also pointed to the fact that GPT-4 was likely composed of 8 or 16 MoEs.

Why is MoE a must for large, high-performance models?

MoE is a neural network architecture design that integrates expert/model layers in the Transformer module.

As data flows through the MoE layer, each input token is dynamically routed to the expert submodel for processing. When each expert specializes in a specific task, this approach allows for more efficient calculations and better results.

MoE's most critical components:

- Expert: The MoE layer consists of many experts, small MLPs, or complex LLMs such as Mistral 7B.

- Router: The router determines which input tokens are assigned to which experts.

There are two types of routing strategies: token selects router or router selects token.

Routers model probability distributions by experts or tokens using softmax gating functions and select the top k.

Benefits of MoE:

- Each expert can specialize in different tasks or different parts of the data.

- MoE architecture can add learnable parameters to LLM without increasing inference costs.

- Efficient computation of sparse matrices can be exploited.

- Compute all expert layers in parallel to efficiently leverage GPU parallelism.

- Helps effectively extend models and reduce training time. Get better results at lower computational costs!

Large language models (LLMs) have swept the field of machine learning, and as modern datasets increase in complexity and size, each contains different patterns, with distinct relationships between features and labels.

At this point, MoE was needed.

Hybrid of experts (MoE) is like a teamwork technique in the world of neural networks.

Imagine breaking down a large task into smaller parts and having different experts handle each part. Then a wise judge decides which expert advice to follow, depending on the circumstances, and all of these recommendations are mixed together.

Just like you combine different flavors into a delicious dish.

Complex data sets can be divided into local subsets, and similarly, the problem to be predicted can be divided into subtasks (using domain knowledge or unsupervised clustering algorithms).

Expert models, which can be any model such as a support vector machine (SVM) or neural network, are then trained for each subset of data, each receiving the same input pattern and making predictions.

MoE also contains a gating model that interprets the predictions made by each expert and selects which experts to trust based on input.

Finally, MoE requires a pooling method to make predictions based on gating models and expert outputs.

In real-world applications, the researchers propose a method called "sparsely gated expert hybrid layers" that, as iterations of the original MoE, provides a generic neural network component that can adapt to different types of tasks.

The Sparsely-Gated Mixture-of-Experts Layer consists of many expert networks, each of which is a simple feedforward neural network and a trainable gating network. The gating network is responsible for selecting a sparse combination of these experts to process each input.

The emphasis here is on using sparsity in the gating function-meaning that for each input instance, the gating network selects only a few experts to process, leaving the rest inactive.

This sparsity and expert selection are implemented dynamically for each input, the whole process is highly flexible and adaptive, and the computational efficiency is greatly improved because there is no need to deal with the inactive part of the network.

In short, it is fast calculation, low consumption and saving money.

MoE layers can be stacked in layers, where the master MoE selects sparsely weighted "expert" combinations. Each combination uses a MoE layer.

In addition, the researchers have come up with an innovative technique called Noisy Top-K Gating.

This mechanism adds an adjustable Gaussian noise to the gating function, retaining only the first K values and assigning the remaining values to negative infinity, thus converting to a zero-gated value.

This approach ensures sparsity of the gating network while maintaining robustness to potential discontinuities in the output of the gating function. It also helps with load balancing across expert networks.

MoE and Transformer Let's take a look at how MoE plays a role in Transformer, the big language model of the current fire.

MoE is designed as a neural network architecture that can be integrated into the Transformer structure.

As data flows through the MoE layer, each token is dynamically routed to an expert model for computation, so that each expert can focus on a specific task and deliver better and more efficient results.

The figure above shows the evolution of the Transformer encoder with MoE layers (and similar modifications for the decoder), which replace the feedforward layers of the Transformer.

The encoder of the standard Transformer model is shown on the left of the image above, consisting of a self-attention layer and a feedforward layer, interleaved with residual connections and normalization layers.

In the middle part, the model structure of MoE Transformer Encoder is obtained by replacing other feedforward layers with MoE layers.

On the right side of the image above is what happens when the model scales to multiple devices, where the MoE layer is sliced across devices and all other layers are copied.

We can see that the key components of MoE are various expert models and routing modules.

Expert models can also be small MLPs or complex LLMs such as Mistral 7B.

The routing module determines which input tokens to assign to which experts.

There are generally two routing strategies: token selects router, or router selects token. Here, the softmax gating function is used to model the probability distribution through expert models or tokens and select top k.

From this, we know that MoE layer plays an important role in Transformer.

Each expert can specialize in different tasks or different parts of the data; MoE can be used to add learnable parameters to LLM without increasing inference costs;

In addition, MoE also facilitates efficient computation of sparse matrices, and the expert layer in MoE can be computed in parallel, thus effectively utilizing the parallel power of GPUs.

Finally, while MoE helps reduce training time, it can also effectively scale models to achieve better results at lower computational costs.

Before Mistral released this open-source 7B × 8E MoE, NVIDIA and Google released other fully open-source MoEs.

Fuzhao Xue, a doctoral student at Singapore National University who interned at Nvidia, said their team also opened an 8 billion parameter MoE model four months ago.

Project address: github.com/ XueFuzhao / OpenMoE Data sources-half from The RedPajama, half from The Stack Dedup.

- In order to improve the reasoning ability of the model, a large number of programming-related data are adopted.

Model architecture- OpenMoE model is based on ST-MoE but uses decoder-only architecture.

Other designs-use umT5 tokenizer.

- Using RoPE technology.

- SwiGLU activation function is used.

- Set the context length to 2000 tokens.

The BigBench evaluation team conducted small-sample testing on BigBench-Lite, including comparisons with BIG-G, BIG-G-Sparse, and GPT-3.

The relative cost is roughly estimated by calculating the number of parameters activated per lexicon and the number of training lexicons. The size of each dot in the graph represents the number of parameters activated by the corresponding word element. In particular, the light gray dots represent the total number of MoE model parameters.

In response, Jim Fan also said that MoE is not new, it just does not get so much attention...

For example, Google has long open-sourced the T5-based MoE model, Switch Transformer.

Challenges and Opportunities MoE Infrastructure Development

Because MoE has a large number of trainable parameters, an ideal software environment should support flexible combinations of expert, tension-level, pipelined, and data-parallel, both within and across nodes.

In addition, it would be nice to support simple and fast activation offload and weight quantization to reduce the memory footprint of MoE weights.

MoE instruction trim

The FLAN-MoE study suggests that although there are challenges in transferring MoE performance to downstream tasks through task-specific fine-tuning, instruction fine-tuning is effective in aligning with MoE models. This demonstrates the enormous potential of MoE-based language models.

MoE assessment

The inductive bias of the MoE model may have other effects beyond perplexity, as do other adaptive models such as Universal Transformer and AdaTape.

Hardware challenges

It is worth mentioning that GPUs face challenges in communicating across nodes, as each node can usually only be equipped with a limited number of GPUs. This makes communication a bottleneck in expert parallelism.

Fortunately, NVIDIA recently introduced the DGX GH200, which integrates 256 NVIDIA Grace Hopper Superchips into a single GPU, largely solving the communication bandwidth problem and helping with the training and deployment of MoE models in the open source domain.

References:

https://twitter.com/sophiamyang/status/1733505991600148892

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.