Tailoring Intelligence Part 2: Model merging

9 min readJun 18, 2024

Takeaways

Model merging is an emerging, exciting, and growing technique that shows great promise for allowing companies to ingest knowledge into a model in a cost-efficient and scalable way.
Model merging complements fine-tuning, as you can merge fine-tuned models and bring the benefits of each into one model.
Evolutionary model merging is potentially a game-changer for model merging as it removes many complexities and guesswork from the process.
There is exciting potential for model merging in multi-modality by merging vision and language models.
Despite its potential, model merging remains an emerging field. Proven effectiveness in production settings will be a crucial driver of increased adoption. The field requires technical developments in areas such as merging models with different architectures and sizes. For founders interested in model merging, the most practical advice is to focus on techniques that have already been validated.

Introduction to model merging

In Part 1 of our Tailoring Models series, we discussed the crucial role of fine-tuning for companies, emphasizing the significant benefits it offers in improving performance for specific tasks. However, there are many instances where companies want to embed new knowledge into the model that may not be related to the original training data. Fine-tuning processes like LoRA are less capable of adapting to domains that differ substantially from the original training data. The alternative in such circumstances is to pre-train from scratch, but this is too expensive, requires extensive data, and is too complex for most companies. To address these challenges, an emerging field shows great promise: model merging.

Model merging is the process of combining the weights and layers of different models into a single, unified model without requiring additional training or fine-tuning. By merging models, developers can retain essential knowledge while integrating new information. This technique allows them to leverage the strengths of each individual model, providing a cost-effective approach to developing new models that is often achievable using just a CPU. Interestingly, when you merge two models that excel at separate tasks, you not only end up with a model capable of performing both tasks, but it often outperforms the original models on each individual task.

One of the primary advantages of model merging is its ability to mitigate catastrophic forgetting, which occurs when a model loses previously learned information as new data is introduced. The merging process allows you to achieve high performance in a specific domain while maintaining the broad capabilities of the base model (merging back to a specific checkpoint). It does so with a much lower computational budget compared to full fine-tuning and replaying datasets, which can be very expensive, especially for large datasets.

Model merging has been gaining attention thanks to Arcee.ai (a Flybridge Portfolio company) and Charles Goddard, creator of Mergekit (an open-source toolkit for merging pre-trained language models that supports most of the popular merging methods through a simple YAML file). Some of the most popular methods can be found in Appendix B. In a conversation I had with Charles, he mentioned:

“Merging techniques are scalable in a way traditional approaches aren’t. You can independently fine-tune on different domains and then combine them for downstream applications without retraining over the entire set.” — Charles Goddard

It’s important to note that you can merge fine-tuned models, so merging and fine-tuning are complementary toolsets for founders and builders. Another benefit of merging is that it can help reduce the dilution effect when you fine-tune or pre-train a model. For example, you can merge a model trained with your raw data completions into an instructional model. Then, when a new instructional model of the same base comes along that surpasses the previous capabilities, you can merge your completion model with this new instruction model, thereby eliminating the dilution effect.

Adoption Challenges and evolutionary model merging

Despite the potential benefits of model merging, it has not yet been widely adopted in production settings by companies for several reasons. First, model merging is a newly emerging field, and most companies are focusing on proven methods rather than experimental techniques. Another significant challenge to adoption was that, until recently, model merging required a highly manual and experimental process. It involved manually testing different merges and determining how various merging parameters and hyperparameters affect the final model’s performance. This created a barrier to entry, as not everyone possesses the knowledge required to run these experiments effectively. Additionally, many people tested model merging once, and if the result was unsatisfactory, they assumed that was the maximum potential. However, the first attempt would not likely yield optimal results, as the nature of model merging requires trying different combinations to find the best approach. To this point, Charles shared:

“There’s a reputation that model merging is just for gaming leaderboards, which discourages serious adoption. But with the right hyperparameter tuning, it’s a powerful tool for real-world applications.” — Charles Goddard

Earlier this year, Sakana AI released a groundbreaking paper on evolutionary merging in which they applied an evolutionary algorithm (Covariance Matrix Adaptation / CMA-ES Algorithm) to optimize the parameter tweaking for the merging process. This way, they automate the combination of model weights and layers by iterating, evaluating, and merging models based on defined criteria. The core concept is that if you can measure how well a model performs at a specific task, you can use that to guide the optimization process. In this process, you provide the model with a list of evaluations you want to optimize for, and the models go through the evolutionary process of evaluation and merging over several cycles until they find the optimal merging combination.

In the paper, they shared two ways to merge a model. The first is merging models in the data flow space (layers), in which you find the best combination of the layers of the different models to merge and build a new model. The second is merging models in the parameter space (weights). In this approach, you mix the weights in different proportions of the different models to build a new model (akin to mixing colors). Both data flow space and parameter space can be combined to build a new model, and the actual results showed the highest increase in accuracy by combining both methods. It’s important to note that you can leverage the evolutionary process using the different merging methods in Appendix B. For those who want to learn more, I enjoyed this walkthrough of Evolutionary Model Merging by Oxen.

Merging Models in both Data Flow Space and Parameter Space Source

Leveraging the evolutionary merging process, they introduced a series of models like EvoLLM-JP, an LLM that can solve math problems in Japanese. To generate it, they made 128 random combinations, and the best performing were selected. Amazingly, the process took just a couple of days and 16 GPUs. They also released one of the first merged models combining vision and LLM, by merging a Japanese LLM (Sisha Gamma 7B) and the vision model (Llava 1.6 Mistral 7B).

It is worth noting that this approach is highly novel and still has a long way to go, but it appears to be heading in the right direction for gaining wider adoption. However, it is important to mention that it does require more computational resources than the regular merging process, and that the merged models inherit some of the same limitations as the base models.

Arcee: a pioneer in the model merging space

Arcee, a Flybridge portfolio company, is a pioneer in leveraging model merging as part of its SLM adaptation system. Arcee’s approach to model development emphasizes extending the capabilities of base models like Llama-2-base or Mistral-7B-base through domain-specific fine-tuning and continual pre-training using proprietary client data. Additionally, Arcee utilizes model merging to combine the strengths of multiple models into a single, versatile checkpoint, balancing domain-specific expertise with general-purpose functionality. This strategy focuses on fine-tuning smaller language models for specific domains, offering substantial cost savings compared to training LLMs. Arcee employs model merging to synthesize the capabilities of multiple pre-trained models into a single, versatile checkpoint. They have validated their approach, as can be seen in two case studies they shared for legal and medical use cases, in which the merged model outperformed the base models. For example in the medical use case the linear merge model demonstrated superior performance on the PubMedQA benchmark, achieving a score of 75.6 compared to the base Llama2 7B Chat’s 73.4.

Jacob Solawetz, one of the founders of Arcee, emphasizes how they enable customers to own their models and the importance of privacy in their approach:

“With Arcee’s end-to-end VPC service, we enable customers to leverage the power of smaller, specialized AI models, while protecting the privacy and ownership of their data.”

Challenges, opportunities, and future areas of research

As we mentioned at the beginning, model merging is still in its early stages, and despite the advances with evolutionary model merging, adoption levels remain relatively low. We expect that as more novel research by companies like Arcee and Sakana emerges, it will expand and become a central toolkit of the AI stack.

Current limitations and interesting future areas of research include merging models of different sizes and bases. For example, combining models like a Mistral 3B with a Llama 7B, or training a 1B parameter model and merging it with a 70B parameter model. This approach has been partially explored but is still very experimental and not proven.

Challenges with merging models of different architectures and sizes involve among others:

Handling Cross-Attention Mechanisms: Challenges include managing how different models process and integrate attention across various segments or batches of data. The focus is on ensuring that cross-attention mechanisms, which allow models to share contextual information across different data groups, are compatible and effectively integrated.
Integration of Different Attention Configurations: Integrating models with different configurations of attention heads presents a substantial challenge. These heads, crucial for determining the focus points within the data, vary significantly across models in both architecture and function. Harmonizing these differences is key to successful model merging.

Overcoming these challenges and successfully merging models of different sizes and architectures could lead to more powerful and versatile language models, making this an exciting area for future research and development.

Another exciting promise of model merging, initiated by Sakana Research, is the impact it can have on multi-modality, as it allows for the combination of capabilities from vision and language models. As more companies realize the potential of multi-modal use cases, we expect this to become an increasingly important area of research in the merging space.

Two practical pieces of advice for founders and operators are:

Stick to the practices that have been validated so far in model merging, such as merging models of the same base and sizes. Leave the experimental aspects of merging to research-focused companies like Arcee and Sakana, as the investment may not be the best allocation of resources..
Similar to when you fine-tune a model, when you are merging a model, have a clear idea of the goal you want to achieve and how you are going to measure if you reached the desired output.

A key factor that would lead to a wide adoption of model merging is the validation of the results that merged models can achieve in a company production setting. This means that the merging process needs to be validated not just with academic/experimental datasets but also with company datasets. This way, it can cross the chasm from early adopters to the mainstream, as it would no longer be seen as something experimental but as something that can deliver results and ROI, which is what companies care about.

Subscribed

Appendix A: Useful Sources

Charles Goddard Blog and X
Maxime Labonne X + Lazy merge kit (Colab notebook to automate model merging using mergekit. It’s very convenient, only requires a CPU, and allows you to create MoE like Beyonder)
Maya Akim walkthrough merging process
Arcee Blog + Youtube Channel (SLM Show) + Github
Mergekit
Hugging face developer guide

Appendix B: Model merging techniques

Table made by leveraging the following sources that i encourage people to look: Deci comparing model merging methods, MergeKit Merge Methods, Julien Simon Deep dive to model merging

Link to table