This weeks edition of Sunday Paper aims to explore some alternative architectures that aim to replace attention. This is not a new topic and it has been explored in great detail with Linear Attention mechanisms such as RWKV, RetNet, and my personal favorite Based (a model which I talk about here). Even SSMs such as Mamba have shown promise as an Attention alternative. While these models are endlessly interesting, in this post I aim to explore some other proposed architectures which are based on/inspired by Recurrent architectures instead. I think that understanding these novel language modelling architectures will be key as new Hybrid archictectures such as Jamba begin to become more popular.
Similar to last week, the final papers this week are instead focused on optimizers, a topic which I wanted to more fully explore after talking about GaLore. These optimizer improvements are valuable regardless of architecture design so they felt like something that could still provide value within this discussion. Happy reading!
Sunday: Griffin and Hawk (model | paper)
Inspired by the recent success of SSMs like Mamba, Griffin and Hawk introduce a new recurrent block called LRU. While previous Linear-RNNs have failed, SSMs have found great performance while maintaining linearity in time. The LRU is a return to Linear-RNNs which incorporate the diagonalization and exponentiation techniques used by SSMs. LRU does make some changes as well, using a specialized random initialization (rather than SSMs deterministic one) and uses a linear scan for computation (rather than an associative scan). What is most puzzling is that this linear scan is actually faster in spite of its worse computational complexity. While these differences do play a role, the key differentiator between the LRU and an SSM is that the recurrent gate proposed specializes in discarding useless input rather than discard useless memory.
The two archictecures proposed in this paper, Hawk and Griffin are built upon this LRU block, with Hawk being composed entirely of this new recurrence and Griffin instead interleaving this recurrence with sliding window attention. While Hawk is able to create high performance as a pure recurrent model, Griffin is the true showpiece of this paper, generating Transformer level performance at very low cost.
Monday: RecurrentGemma (model | paper)
RecurrentGemma tries to apply the Griffin architecture within the Gemma training scheme. While this paper may seem like more of a formality than an actual research acheivment, the RecurrentGemma model proves some of the statements made in the original Griffin paper; that recurrent architectures can achieve the same performance as Transformers. And while RecurrentGemma is slightly worse, it is much faster at inference, making this an ideal model for deployment.
Tuesday: TransformerFAM (paper)
TransformerFAM aims to incorporate a recurrent mechanism into attention to create a short-term/working memory mechanism that doesn’t rely on a giant model context window. This basic idea of incorporating some recurrent structure within attention has been floating around in research recently and is somewhat similar motivation to Infini-attention. Within TransformerFAM, these recurrent mechanisms are placed within each input block in the form of “feedback-neurons”. Future input blocks then attend to any “feedback-neurons” within their context, theoretically allowing for information compression within these specialized neurons. Interestingly enough the size of this recurrent memory does not directly correlate with better performance, a counter-intuitive discovery.
Wednesday: KAN (paper | code)
KANs (Kolmogorov-Arnold Networks) have been a hot topic in ML and AI. KANs aim to substitute the Universal Approximation Theorem used to design MLPs with the Kolmogorov-Arnold Approximation Theorem to generate networks with learnable b-spline activation functions along network edges. Empirical evidence shows that parameter efficient KANs are great at learning functions within a bounded region. The main drawback however, is that KANs do not receive the same GPU hardware speedups as MLPs. While this discovery is exciting, the AI community will need to see how this tech performs on MNIST/CIFAR type tasks as well as within ResNet architectures before it can truly be adopted at the cutting edge of AI. Overall it is something that is very cool and will likely have applications within various scientific fiels, but it is likely 1.5 years and 3-4 papers from making improvements to large AI models (if it ever even gets there). I think KANs are something worth paying attention to for now, but I will not personally be spending my time trying to rigorously understand and implement them any time soon.
I think generative modeling is particularly subtle because these two goals are not necessarily aligned: (1) fit score function well; (2) being able to generalize. My intuition is that KANs are good at (1) but not necessarily (2) (which is the true goal of generative modeling).
-Ziming Liu, first author on KAN paper
Thursday: xLSTM (paper)
xLSTM aims to explore whether LSTMs can compete with modern Transformer architectures, a feat which they show is truly possible. They are able to achieve this with two key modifications introduced within two new LSTM blocks, the sLSTM block and the mLSTM block. The first key change is a simple exponential gating mechanism which allows for greater revision of LSTM memory. This change is used within both of the new LSTM blocks. The final change is the introduction of matrix memory used by the mLSTM block, an adjustment to the standard scalar memory of prior LSTM architectures. This enhances the overall memory capacity of the LSTM architecture. There are many sources that are saying that this architecture is parallel, and while yes, the mLSTM block is parallel, the sLSTM block is not and is in fact 1.5 times slower than the current parallel mLSTM block. This still makes xLSTM a fast training LSTM, but it is 4 times slower than FlashAttention or Associative scans used by Transformers and SSMs respectively.
Friday: DeepSpeed-ZeRO (paper)
Similar to last weeks paper GaLore, ZeRO seeks to optimize the optimizer. While GaLore sought to do this through a low-rank optimizer, ZeRO aims to reduce the memory footprint of the optimizer through de-duplication and removal of redundancies. This improvement is aimed at multi-GPU setups where these kinds of issues are found and shows incredible speed-ups when training a 100B parameter model. This is incredibly valuable as our models continue to scale and evolve and this means that we can either continue to grow these models in parameter count or in batch size.
Saturday: DeepSpeed-ZeRO++ (paper)
ZeRO++ aims to further improve upon ZeRO by introducing three techniques designed to reduce the communication required between GPUs in a training cluster. These improvements consist of a block-based model quantization, maintaining a full model copy at each machine, and quantized gradient communication and recovery. These adjustments are shown to reduce the communication overhead of the original ZeRO by 4x, a vital improvement within low-bandwidth or small batch size training environments. This huge improvement continues the push to make training LLMs that much easier and more democratized.
Conclusion
While last week I came away with some overall feelings about the ways in which we can improve LLM training, this week I find myself without a solid moral to this story. However, I feel no less impacted by the findings of these papers, specifically the LRU proposed in Monday’s reading and maybe even the mLSTM proposed within Thursday’s. While it is not research I seek to do, I feel that these blocks may find value within a Hybrid style architecture like we saw in Jamba, where Mamba blocks and Transformer blocks were woven together.
If you enjoyed the discussion about optimizers both this week and last week I think it is valuable for you to investigate the entire body of work for DeepSpeed. This work is amazing and is incredibly helpful for increasing access to AI for all within the field.
If you liked this post please check out my main blog and consider subscribing for free. All of my content is free and will continue to be free. I try to post on my main blog twice a month on Mondays (may change to Wednesdays going forward) and I will aim to post here every Sunday. I like to talk about cutting edge AI research and AI philosophy in a manner that is easy to understand for semi-technical audiences.


