Sunday Paper: 1st Edition

Papers that explore data efficient training of LLMs

May 12, 2024

I wanted to start a new section of this blog which I dedicate to the jungle of AI Research. I have challenged myself to read a paper each day until the end of July and wanted to post the papers I have chosen to read here with short descriptions and my personal key takeaways. These posts will be much less flashy than a full-blown blog post but will help show “how the sausage is made”, i.e. how I go about learning new material in the field of AI.

Each Sunday I plan to post a schedule with seven papers that I have found interesting enough to dedicate real time to reading and understanding them. My discussion of these papers will be rather limited within these posts and I will likely save more groundbreaking thoughts and feelings for the main feed as I would like these posts to first and foremost serve as a source for inspiring research. Finding good research papers has been one of the hardest tasks for me to get good at, so I wanted to share what I find with those who follow my content.

For the first Sunday Paper I wanted to surface some research around data efficient training of LLMs. These papers have been a part of a trend I have seen where researchers are beginning to analyze the dynamics of pre-training and its cumbersome nature. Removing the pre-training assumptions of access to large corpora of unknown quality language data that require thousands of hours of compute are invaluable for democratization of AI research. It is my opinion that some of findings shown in these papers are actually what has created such great performance increases within models such as Llama3 (a technical report which will surely show up in a future Sunday Paper when it’s released). Even small models such as JetMoE (discussed in brief below) seem to use these findings to train highly performant models for cheap. While most of these papers focus on the data portion of pre-training, the last paper I wanted to read is a little bit of a wild card, a new memory efficient optimizer that I feel fits in with some of the views the other papers pose on training phases. But let’s get to the papers.

Monday: Phi-1 (model | paper)

Phi-1 is a relatively small (~1B param.) language model designed for code generation and code question answering. The paper’s big claim is that the use of high-quality training data can create “above parameter” performance. The major hurdle for this paper is the curation of this high-quality dataset. The authors of this paper use a high-performance language model (likely a GPT-3.5 or GPT-4 model) to generate synthetic data for training. To create diversity in this synthetic dataset the authors implement some random vocabulary constraints for each model output specified within the prompt. While I wish there were more details on this generation, the authors want to keep that information somewhat secret at the moment.

Tuesday: Phi-1.5 (model | paper)

Phi-1.5 serves as an extension of Phi-1, simply applying the same methodology discussed above to the domain of language understanding. This involves the creation of a new synthetic dataset for model training, following similar prompting techniques as above. The authors additionally posit that this synthetic dataset generation can be viewed as a form of “soft” knowledge distillation.

Wednesday: Phi-2 (model | blog)

Phi-2 serves as another extension of the Phi model family, but rather than extending Phi to a new domain, they increase the size of the model, both in terms of parameters and dataset size. Again this model trains on synthetic data and again we see large performance at small scales.

Thursday: Phi-3 (model | paper)

Phi-3 (I promise this is the last Phi model) serves as a continuation of Phi-2, simply increasing the parameter count and dataset size once again. The interesting piece of this research is that it seems to confirm the improved scaling of these high-quality models. This means that not only are these models seeing great performance at small scales, but their performance to parameter growth is much faster than traditional language models trained on copious amounts of mixed-quality data.

Friday: Rho-1 (model | paper)

Finally, a new model! Rho-1 introduces a new mechanism for computing the loss, something the authors call Selective Language Modelling (SLM). SLM aims to increase the quality of the pre-training dataset by using a reference model trained on high-quality data to evaluate the importance of certain tokens during training. Tokens that are determined to have high importance by the reference model are selected during training while all other tokens are discarded (It is worth noting that this selection is dynamic throughout training even though the reference model remains frozen). This selection mechanism helps to artificially increase the quality of the training dataset and causes similar increases in training performance.

Saturday: MiniCPM (model | paper)

MiniCPM introduces a two-stage pre-training pipeline. While the first stage looks similar to traditional pre-training, the second stage is what the authors call the “annealing phase”, where high-quality data is introduced to the model as a way to improve model performance. While the papers discussed above seek to remove the need for mixed-quality data altogether, MiniCPM realizes that mixed-quality data still contains valuable signal for training. This allows the model to learn language basics from large volumes of mixed-quality data before using high-quality data to truly master language, theoretically reducing the volume of high-quality data necessary during pre-training. This work has been validated by other research such as JetMoE (model | paper), a work which I will likely discuss again in future Sunday Papers. According to JetMoE a high performance model can be trained for a fraction of the cost using this two-phase training technique.

Sunday: GaLore (code | paper)

While the previous papers focused on the data used to train the model, GaLore uses the discoveries of LoRA (paper) to create a memory efficient optimizer. While LoRA used rank-decomposition of the model weights themselves, GaLore performs this rank decomposition on the gradient matrices. While it is natural to assume that this technique can only be used for fine-tuning, the authors actually show that it can even be used to pre-train 1B parameter LLMs on a single gpu without severe performance degradation. Even now there is still ongoing work to continue to improve GaLore and make it even more compute efficient, a truly exciting prospect.

Conclusion

The increased attention on AI research has been a blessing for the industry, increasing research advancements to a rate I have never seen before. This has additionally led to calls for increased democratization, something I alluded to earlier in this post. I think work like this is instrumental in the goal of democratization and it is a topic I will likely revisit in a few short weeks. There is more work in the realm of optimizers like GaLore that continue to decrease the number of compute resources necessary to train these models. In my opinion these works prompt a new training technique/paradigm which I am inclined to further explore in a full blog post in the coming months. Feel free to leave a comment about your thoughts on these papers as I would love to discuss them further.

Going forward I hope to create a post like this one every Sunday. While this one was heavily driven by a singular topic I expect future weeks to turn into a mashup of cool papers that cover a whole host of different topics. As a quick sneak peak, next weeks Sunday Paper is aimed at some alternative language model architectures.

If you liked this post please check out my main blog and consider subscribing for free. All of my content is free and will continue to be free. I try to post on my main blog twice a month on Mondays (may change to Wednesdays going forward) and I will aim to post here every Sunday. I like to talk about cutting edge AI research and AI philosophy in a manner that is easy to understand for semi-technical audiences.