Sunday Paper: 3rd Edition

Exploring Mixture of Experts and its Inspirations

May 26, 2024

This edition of Sunday Paper does assume prior knowledge of the base Mixture of Experts (MoE) architecture, if this is something you don’t know or would like a refresher on huggingface has a great blog post explaining some of the basics around MoE. This week I wanted to explore various extensions and implementations of this architecture as sparse models have taken the AI world by storm. Between Mistral, DBRX, DeepSeek, Qwen, and others, MoE appears to be the current state-of-the-art for generating highly performant LLMs. The extensions of the MoE architecture we will cover this week aim to explore further performance gains possible within the architecture, solve weaknesses within the architecture, and propose new methods for training these creatures. This topic has been particularly hot over the past 9 months so this list will be far from exhaustive but I do find it to be important.

Sunday: X-MoE (paper)

X-MoE aims to solve an instability within MoE architectures known as representation collapse. This occurs when the hidden-states exiting the attention mechanism are pulled towards the expert embeddings, thus hindering performance. X-MoE mitigates this issue through two key design decisions. First, X-MoE uses a cosine similarity routing strategy in place of the traditional dot-product strategy. And second, X-MoE uses low-dimensional experts. This means that hidden-states must first be down-projected into this low dimensional expert embedding space, limiting how much information is needed for expert routing. I love this paper and I hope that future works experiment with this more as I think it could create some valuable performance gains even if they are slight.

Monday: MH-MoE (paper)

MH-MoE aims to incorporate multiple FFN heads into the MoE network. Similar to MHA, MH-MoE heads are created through a linear up-projection and then each head is independently routed through to a subset of experts. While the authors show increased performance with this new formulation, I actually believe that the performance gains are more similar to DeepSeekMoE and its use of “granular” experts, something which will be covered on this weeks Thursday paper.

Tuesday: MoA (paper)

MoA seeks to extend the routing mechanism of MoE to the Query transformation and Output transformation within MHA, thus allowing each token to create a unique attention computation through routing. It is worth noting that the Keys and Values are not routed, instead being generated in the same way as standard MHA. The authors show that this model has increased performance compared to MHA, however, they also show that performance does not increase as the number of Attention experts is increased unlike MoE. This finding makes me believe that there is an issue withing the MoA computation/formulation, however, my thoughts on this are not well formed yet. My chief complaint is that the routing of Queries feels poorly motivated to me. If we view Keys and Queries as questions and answers, then the number of valuable questions that can be asked for a set of static answers is quite small. Combine this with the fact that the Queries and Keys are already created in an input dependent manner and I struggle to see the value of a sparse model. While I would love to speak more on this I will likely save it for a full feature blog post where I can discuss in depth and potentially present code and experiments.

Wednesday: JetMoE (model | paper)

JetMoE combines MoE with the MoA mechanism from Tuesday. While they claim that this allows for their increased performance compared to models such as DeepSeekMoE which we will cover tomorrow, they use two approached covered in the First Edition of Sunday Paper that I believe are responsible for the performance gains. The first is that JetMoE uses the training data setup of MiniCPM, something they specifically cite in their paper. The second is that they used dSFT (the d stands for distilled). This approach is rather simililar to the Phi models SFT phase where synthetic data generated by a highly performant LLM such as GPT-4 is used for fine-tuning. These two approaches increase the overall quality of the data during the later phases of training. The biggest gain however, is that the authors claim this model can be trained with only $80k in computation costs. This would be an incredible achievement if true and a win for AI democratization.

Thursday: DeepSeekMoE (model | paper)

DeepSeekMoE introduces some huge breakthroughs in MoE architectures. If you are to read one paper from this week it should be this one. The paper proposes two key adaptations to the MoE framework. The first, mentioned above, is the use of “granular” experts. This means that the authors increase the number of experts (and the number of selected experts) by some multiplicative factor, however, they in kind reduce the internal dimension of each expert by that same factor. This means that overall computation relative to the base MoE architecture is kept the same while allowing experts to become even more specialized. The second contribution is the creation of shared experts. These shared experts are held out from the routing mechanism and instead are fed the input every time and helps to decrease expert duplication.

Friday: DeepSeek-V2 (model | paper)

DeepSeek-V2 builds upon DeepSeekMoE to create a new generation of LLM that can be more efficiently run at inference (with respect to the KV cache). The way this is done is through the creation of MLA (Multi-head Latent Attention), which imposes a low rank constraint on the attention computation.This means that the KV cache can store low rank projections of the input and then perform an up-projection at run-time. The authors then show that the Keys and Values can share the same down-projection and that the up-projections for the Keys and Values can be absorbed into the Query projection and Output projection matrices respectively, further reducing memory costs. This creates a model that performs at the level of Llama3-70B for the cost of only 21B active parameters. While this may seem like a free-lunch, this model contains a total of 236B parameters, a hefty memory cost for the low compute overhead, but still an impressive feat nonetheless. It is worth noting that not only is MLA more memory efficient, but the authors claim that it is more performant that full MHA; there may even be opportunity for further computational efficiencies through the combination of GQA and MLA together. The last feature I want to call out is that DeepSeek-V2 mentions a goal to increase their data quality for pre-training. This is a trend I continue to see and continue to mention in these Sunday Paper posts. I think it is the most valuable change for many of these foundational models that no one is talking about and is incredibly valuable for extracting performance from LLMs.

Saturday: BranchTrainMix (paper)

BranchTrainMix (BTX) takes inspiration from BranchTrainMerge (a method for ensembling LLMs) but attempts to create a MoE model rather than a simple ensemble. To do this BTX takes a seed model and trains multiple copies of it on various data subsets, creating unique fine-tunes. These fine-tunes are then combined to make a MoE model which can now be further fine-tuned. The ability to actually fine-tune this model provides a huge performance gain of BranchTrainMerge, however, the authors argue that BTX’s form of parallelism may even make it better than the default strategy for training MoEs.

Conclusion

This week provided a lot of great papers on MoEs, however, I am far from caught up on the subject. Qwen announced their own MoE model in a blog post a few months ago and I am currently waiting with bated breath for the release of the technical paper associated with this model. This technical paper will hopefully shed more light on their MoE seeding and pre-training procedure, something which they give a cursory explanation of in their blog post. Additionally papers such as DS-MoE could help decrease the memory required to run these models as most MoE models require 2-4x the number of parameters to achieve the performance of Dense models (granted the activated params. is 2-4x less than these Dense models). In the coming months I will be posting a programmer’s guide to creating MoE models in PyTorch and will be open-sourcing any and all related code.

If you liked this post please check out my main blog and consider subscribing for free. All of my content is free and will continue to be free. I try to post on my main blog twice a month on Mondays (may change to Wednesdays going forward) and I will aim to post here every Sunday. I like to talk about cutting edge AI research and AI philosophy in a manner that is easy to understand for semi-technical audiences.