Beyond the Transformer: The Elements of Future AI Architectures
An analysis of what it will take to replace the Transformer and a look into what current research could lead to a replacement.
Attention has been all AI has needed for almost 6 years now, and since the Transformer made its debut in NLP, many new model architectures have made an attempt at the throne, seeking to become the next keystone architecture in AI. While the Transformer architecture is far from a perfect, its ability to efficiently scale input volume and model size while avoiding the exploding/vanishing gradient problem make it an incredibly flexible architecture that has been nearly impossible to replace. This seeming invincibility has made it easier for research to augment the existing Transformer architecture rather than try to all out replace it. This means we have 6 years of research on architecture improvements and hardware optimizations, further cementing the Transformer as the best general sequence model architecture available. With the explosion in popularity of LLMs as the current peak of AI, any architecture that wants to replace the Transformer will have to perform at gigantic parameter scales, approaching tens or even hundreds of billions of parameters. And this is the ultimate barrier to entry, how can any new architecture show enough promise at the hundred million to few billion scale to be tested on a level competitive with LLMs.
In this article we will look to analyze the value of the Transformer architecture as a general sequence model and LLM backbone as well as some of the properties that new models will seek to improve upon. While this article will mention some of the new architecture challengers, the focus is to see what properties the next great general sequence model will need to have and show what work will serve as a foundation for a Transformer replacement or maybe just future Transformer augmentation.
The Glaring Weakness
While Transformer architectures exhibit notable strengths which have made them the keystone architecture for sequence modelling tasks, they also harbor certain limitations that researchers aim to address. Chief among these limitations is the Transformer’s quadratic inference cost with respect to sequence length. This cost arises from the fact that each element in the input sequence attends to every other element, leading to a quadratic growth in the total number of pairwise interactions as the sequence length increases.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe33524e8-a8cf-4aae-8810-4d9253c91d4e_621x291.png)
This stands in stark contrast to Recurrent Neural Networks (RNNs), which have a linear relationship with sequence length as each step only needs the previous hidden state and the current input token. This means an RNN only needs to save the outputs of the previous step while a Transformer needs to save the entire output of all previous steps. Resolving the quadratic inference cost dilemma holds the potential to enable the use of longer context windows and larger-scale models as well as cheaper model deployment and faster result generation.
Any model that aims to replace the Transformer must achieve this sub-quadratic inference time, however, the extension of context windows may offer an additional benefit related to long-range dependencies in text processing and understanding. If the attention window can be expanded or even removed, it opens the door for language models to handle entire documents or books, significantly enhancing their memorization and reasoning capabilities. This would starkly improve text embeddings and summaries as the model would be able to “remember” greater quantities of processed text.
In defense of the Transformer
On the positive side, Transformer architectures exhibit several compelling advantages. Their parallelization capability allows for efficient training on specialized hardware like GPUs, reducing overall training times and allowing the model to scale the number of training tokens and the overall model size. This is in stark contrast to the RNN which must be trained sequentially, making it incredibly difficult to feed enough data into the model. On top of this RNNs suffer from the exploding/vanishing gradient problem as they must remember tokens from earlier in the sequence, a problem which Transformers ignore by giving attention to all tokens in the sequence, making scalability in data a parameters easy.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b7dd3c-52f3-4e41-a081-4003a0ae0437_800x505.png)
The scalability of Transformers enables effective modeling of relationships across a varied number of sequence types. This flexibility is a notable asset, as Transformers can be applied to diverse tasks with minimal task-specific modifications and are responsible for the Foundation Model (FM) revolution. This has all lead to state-of-the-art performance in natural language, natural image, and speech domains.
Another key benefit of the Transformer is its ability for Associative Recall (AR) or the ability to remember unique pairs of tokens from the distant past. A simple example of this would be a first-last name pairing such as Albert Einstein. A Transformer would only need to see this a few times to begin accurately expecting Einstein to appear after Albert. While this may not seem like an incredibly powerful feature, Transformer challengers such as RWKV and RetNet struggle to memorize the long-range AR pairs.
The next challengers
There have been a litany of challengers to the Transformer just in recent years. Models such as RetNet, RWKV, Hyena, Linear Transformers, and a few more have all set their sights on becoming the next Transformer but none have yet succeeded. They have however, shown what the next Transformer will need to look like in order to be successful. The next Transformer will have to maintain existing training parallelization, excel at AR, and show an ability to be run sub-quadratically at inference all while being able to scale to billions of parameters, and the truth is this all together may still not be enough to unseat the Transformer. Because of the massive amount of technical development spent on Transformers it is likely that the next keystone architecture will additionally have to either seamlessly integrate with existing hardware or will have to bring a novel capability that the Transformer does not already possess.
Two current models appear to have a line of research that may actually have a chance to wear the crown, Mamba and BASED. While neither model has been explored at the hundred billion parameter scale, both have shown empirical promise at a couple billion parameters, often out-competing Transformer architectures that have billions more parameters.
The Mamba architecture was developed off of the backs of SSM research with the first breakthrough in accurate sequence modelling using an SSM being S4. While the underlying SSM in Mamba (S6) has evolved quite drastically since its inception, the model is built around the idea of maintaining and memorizing long-range dependencies, a feature which Transformers cannot do outside of their context window. Couple this with parallel training, strong AR, and quadratic inference and the Mamba model is a promising competitor. The Mamba model, however, appears to have two key drawbacks, first of which is its misfit with modern hardware. The Mamba model trains using a Scan, an efficient operation but one that is not optimized by GPU computation like the Transformer. On top of this the backpropagation step for Mamba seems to be rather complicated as intermediate states must be recomputed in order to avoid memory bottlenecks. While neither of these drawbacks is the end of the Mamba architecture and its potential, it is likely that future research will look into solving these issues in order to overtake the Transformer.
The BASED architecture, on the other hand, was built with a focus on AR, the key pitfall of all Transformer challengers until now. To do this the BASED architecture combines a short-range convolution and a long-range taylor-series attention. In addition to strong AR, BASED can be easily parallelized at training, inferenced sub-quadratically. This is all while running well on GPUs using traditional Transformer computation methods. This makes the BASED architecture incredibly attractive as it currently seems to be a better version of the Transformer, only needing to be tested on large scale LLMs in order to empirically verify performance.
This may make it seem like BASED is the clear winner with Mamba being a close second, however, the long-range dependency strengths of Mamba will be the true wild-card. For tasks such as document summary or embedding that long-range dependencies are vital, likely making Mamba the preferred architecture. Additionally any comparisons of the two models is likely premature as neither have been tested on LLM-like scales and both will require ample research in order to understand them the way we understand modern Transformer architectures.
Conclusion
While Mamba and BASED are both incredibly promising, I wouldn’t expect the Transformer to leave its throne any time soon. The current LLM craze around ChatGPT and Gemini was built on the backs of Transformers and the costs of training a new LLM around either of these challengers is likely to outweigh any benefits at this current moment. Combine this with the large amount of research that will need to be done on the architectures to intuitively understand how they work and it may not be until the next generation of LLMs that we see a new architecture adopted. All of this being said, we still know what properties the next Transformer will need to have with chief among them being sub-quadratic inference.