Model Merging by Output-Space Projection

摘要

Model merging combines fine-tuned checkpoints into a single multi-task model without retraining. Existing methods - such as task arithmetic, model soups, TIES, and DARE - are computationally efficient and empirically successful, but rely on heuristic design choices and lack formal optimality guarantees. We show that merging can be formulated as a convex quadratic programme over residual updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs, and subsuming existing methods as special cases. Our framework yields a closed-form diagnostic - the fraction of residual energy captured by a chosen basis - that predicts downstream merge quality using only the calibration set. Empirically, the QP matches or outperforms existing methods in the single-layer setting, and we characterise when the optimal basis provides significant gains over the cheaper diagonal QP. We extend to multi-layer merging via a sequential layer-wise algorithm and demonstrate consistent gains across language and vision benchmarks.

核心问题与主要方法

核心问题

尚未提取。

主要方法

尚未提取。

关键贡献与后续阅读

关键贡献

尚未提取。

研究启发

尚未提取。

参考文献

12 条

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale . arXiv preprint arXiv:2010.11929 . Cited by: §D.4 .
C. Eckart and G. Young (1936) The Approximation of One Matrix by Another of Lower Rank . Psychometrika 1 ( 3 ), pp. 211–218 ( en ). External Links: ISSN 0033-3123, 1860-0980 , Document Cited by: §4 .
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning . arXiv preprint arXiv:2501.12948 . Cited by: §D.2 .
G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi (2023) Editing models with task arithmetic . In International Conference on Learning Representations , Cited by: Appendix A , §1 .
X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Chaudhury (2023) Dataless knowledge fusion by merging weights of language models . arXiv preprint arXiv:2212.09849 . Cited by: Appendix A .
M. S. Matena and C. A. Raffel (2022) Merging models with fisher-weighted averaging . In Advances in Neural Information Processing Systems , Cited by: Appendix A .
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision . In International conference on machine learning , pp. 8748–8763 . Cited by: §D.4 .
G. Stoica, D. Bolya, J. Bjorner, P. Ramesh, T. Hearn, and J. Hoffman (2024) ZipIt! Merging Models from Different Tasks without Training . arXiv . Note: arXiv:2305.03053 [cs] External Links: Link , Document Cited by: Appendix A .
S. Toshniwal, W. Du, I. Moshkov, B. Kisacanin, A. Ayrapetyan, and I. Gitman (2024) Openmathinstruct-2: accelerating ai for math with massive open-source instruction data . arXiv preprint arXiv:2410.01560 . Cited by: §D.2 .
M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carlin, S. Kornblith, and L. Schmidt (2022) Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time . In International Conference on Machine Learning , Cited by: Appendix A , §1 .
P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal (2023) TIES-merging: resolving interference when merging models . In Advances in Neural Information Processing Systems , Cited by: Appendix A , §1 , §3 .
L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li (2023) Language models are super Mario: absorbing abilities from homologous models as a free lunch . arXiv preprint arXiv:2311.03099 . Cited by: Appendix A , §1 , §3 .

Model Merging by Output-Space Projection

摘要

相关性判断

核心问题与主要方法

核心问题

主要方法

关键贡献与后续阅读

关键贡献

研究启发

参考文献

底部评论