Model Merging by Output-Space Projection
摘要
Model merging combines fine-tuned checkpoints into a single multi-task model without retraining. Existing methods - such as task arithmetic, model soups, TIES, and DARE - are computationally efficient and empirically successful, but rely on heuristic design choices and lack formal optimality guarantees. We show that merging can be formulated as a convex quadratic programme over residual updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs, and subsuming existing methods as special cases. Our framework yields a closed-form diagnostic - the fraction of residual energy captured by a chosen basis - that predicts downstream merge quality using only the calibration set. Empirically, the QP matches or outperforms existing methods in the single-layer setting, and we characterise when the optimal basis provides significant gains over the cheaper diagonal QP. We extend to multi-layer merging via a sequential layer-wise algorithm and demonstrate consistent gains across language and vision benchmarks.
相关性判断
lowThe paper is primarily about model merging in machine learning, not core information theory or communications. It is labeled cs.IT and uses projection, quadratic programming, and spectral/eigendecomposition ideas, which are adjacent technical methods, but the application domain is not a direct fit.
尚未形成判断。
核心问题与主要方法
核心问题
尚未提取。
主要方法
尚未提取。
关键贡献与后续阅读
关键贡献
尚未提取。
研究启发
尚未提取。
参考文献
12 条- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale . arXiv preprint arXiv:2010.11929 . Cited by: §D.4 .
- C. Eckart and G. Young (1936) The Approximation of One Matrix by Another of Lower Rank . Psychometrika 1 ( 3 ), pp. 211–218 ( en ). External Links: ISSN 0033-3123, 1860-0980 , Document Cited by: §4 .
- D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning . arXiv preprint arXiv:2501.12948 . Cited by: §D.2 .
- G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi (2023) Editing models with task arithmetic . In International Conference on Learning Representations , Cited by: Appendix A , §1 .
- X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Chaudhury (2023) Dataless knowledge fusion by merging weights of language models . arXiv preprint arXiv:2212.09849 . Cited by: Appendix A .
- M. S. Matena and C. A. Raffel (2022) Merging models with fisher-weighted averaging . In Advances in Neural Information Processing Systems , Cited by: Appendix A .
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision . In International conference on machine learning , pp. 8748–8763 . Cited by: §D.4 .
- G. Stoica, D. Bolya, J. Bjorner, P. Ramesh, T. Hearn, and J. Hoffman (2024) ZipIt! Merging Models from Different Tasks without Training . arXiv . Note: arXiv:2305.03053 [cs] External Links: Link , Document Cited by: Appendix A .
- S. Toshniwal, W. Du, I. Moshkov, B. Kisacanin, A. Ayrapetyan, and I. Gitman (2024) Openmathinstruct-2: accelerating ai for math with massive open-source instruction data . arXiv preprint arXiv:2410.01560 . Cited by: §D.2 .
- M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carlin, S. Kornblith, and L. Schmidt (2022) Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time . In International Conference on Machine Learning , Cited by: Appendix A , §1 .
- P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal (2023) TIES-merging: resolving interference when merging models . In Advances in Neural Information Processing Systems , Cited by: Appendix A , §1 , §3 .
- L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li (2023) Language models are super Mario: absorbing abilities from homologous models as a free lunch . arXiv preprint arXiv:2311.03099 . Cited by: Appendix A , §1 , §3 .
底部评论
0 条根评论,可继续回复叠楼