OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

摘要

选中正文可添加批注

Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.

核心问题与主要方法

核心问题

How to align more than two modalities with a principled sufficiency-minimality criterion beyond pairwise contrastive losses

场景：Arbitrary-modality multi-modal representation learning with modality-specific encoders and shared embeddings

主要方法

Defines modality-wise sufficiency and minimality relative to all remaining modalities rather than to a single paired counterpart. Uses a one-vs-all InfoNCE-style objective to optimize a tractable lower bound on dependence between each modality embedding and the tuple of remaining modality embeddings. Connects the summed one-vs-all sufficiency terms to a Dual Total Correlation-style dependence measure using entropy inequalities. Scores alignment by projecting a modality embedding onto the span of the other modality embeddings, avoiding a learnable concatenation-to-embedding projector. Uses a KL upper-bound surrogate for minimality, then assumes isotropic Gaussian representation distributions to obtain a tractable squared-distance-style regularizer.

关键贡献与后续阅读

关键贡献

Introduces OVA-IB, a one-vs-all Information Bottleneck framework for aligning an arbitrary number of modalities with modality-specific encoders and shared embeddings. Derives a sufficiency objective where each modality is aligned against the complementary evidence from all other modalities, rather than summing independent pairwise objectives. Links the sufficiency objective to Dual Total Correlation, giving the method a specific information-theoretic target distinct from total-correlation-based methods such as Symile. Provides a closed-form, parameter-free geometry-aware projection score based on the span of remaining modality embeddings, with stated computational advantage over an MLP projector when d is much larger than M. Derives a tractable one-vs-all minimality regularizer that suppresses modality-specific nuisance information by bounding each representation's dependence on its own input using distributions induced by the remaining modalities.

研究启发

Do the appendix proofs justify the DTC sandwich bound and InfoNCE lower-bound connection without hidden assumptions beyond those stated in the excerpt? How large are the absolute improvements in the main tables, especially where the text only says OVA-IB is competitive or the retrieval margin is modest? How sensitive is the method to the isotropic Gaussian approximation used for the closed-form minimality regularizer? Can the one-vs-all objective be adapted to missing modalities during pretraining without changing the theoretical criterion?

限制与不确定性

Evidence comes from abstract and structure analysis only, so derivation correctness and empirical strength are not independently verified. Assumes complete modality availability during pretraining, which may limit practical impact. Experiments are described as moderately sized with scratch-trained encoders, reducing urgency versus foundation-model-scale work.

原文信息

正文记录 1

参考文献 38

最近更新 2026-05-30 13:21

查看正文预览

Abstract

Content selection saved. Describe the issue below:

OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbi

查看参考文献

[1] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021) VATT: transformers for multimodal self-supervised learning from raw video, audio and text . Red Hook, NY, USA . External Links: ISBN 9781713845393 Cited by: §2.1 .
[2] J. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. D. Fauw, L. Smaira, S. Dieleman, and A. Zisserman (2020) Self-supervised multimodal versatile networks . Red Hook, NY, USA . External Links: ISBN 9781713829546 Cited by: §2.1 .
[3] A. Almudévar, J. M. Hernández-Lobato, S. Khurana, R. Marxer, and A. Ortega (2025) Aligning multimodal representations through an information bottleneck . In Forty-second International Conference on Machine Learning , External Links: Link Cited by: §2.2 , §3.1 .
[4] R. Betser, E. Gofer, M. Y. Levi, and G. Gilboa (2026) InfoNCE induces gaussian distribution . External Links: Link Cited by: §3.3 .
[5] B. Chen, A. Rouditchenko, K. Duarte, H. Kuehne, S. Thomas, A. Boggust, R. Panda, B. Kingsbury, R. Feris, D. Harwath, J. Glass, M. Picheny, and S. Chang (2021-10) Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos . Los Alamitos, CA, USA , pp. 7992–8001 . External Links: ISSN , Document , Link Cited by: §2.1 .
[6] S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, and J. Liu (2023) VAST: a vision-audio-subtitle-text omni-modality foundation model and dataset . External Links: Link Cited by: §2.1 .
[7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020-13–18 Jul) A simple framework for contrastive learning of visual representations . In Proceedings of the 37th International Conference on Machine LearningThe Thirty-ninth Annual Conference on Neural Information Processing SystemsProceedings of the 37th International Conference on Machine LearningAdvances in Neural Information Processing SystemsThe Thirty-ninth Annual Conference on Neural Information Processing SystemsProceedings of the 38th International Conference on Machine LearningThe Eleventh International Conference on Learning RepresentationsProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)2019 International conference on robotics and automation (ICRA)The Thirty-ninth Annual Conference on Neural Information Processing SystemsThe Thirteenth International Conference on Learning RepresentationsProceedings of the 34th International Conference on Neural Information Processing SystemsProceedings of the 35th International Conference on Machine LearningProceedings of the 30th International Conference on Neural Information Processing SystemsAdvances in Neural Information Processing Systems2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Advances in Neural Information Processing SystemsProceedings of the 35th International Conference on Neural Information Processing SystemsThirty-seventh Conference on Neural Information Processing SystemsProceedings of the 34th International Conference on Neural Information Processing Systems2021 IEEE/CVF International Conference on Computer Vision (ICCV)Proceedings of the 33rd ACM International Conference on MultimediaAdvances in Neural Information Processing SystemsThe Eleventh International Conference on Learning RepresentationsProceedings of the 36th International Conference on Neural Information Processing SystemsProceedings of the 41st International Conference on Machine LearningProceedings of the Computer Vision and Pattern Recognition ConferenceAdvances in Neural Information Processing SystemsProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of The 7th Conference on Robot LearningConference on Robot Learning9th Annual Conference on Robot LearningProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1Proceedings of the 20th ACM International Conference on Multimodal InteractionForty-second International Conference on Machine LearningThe Fourteenth International Conference on Learning Representations , H. D. III, A. Singh, H. D. III, A. Singh, A. H. Oh, A. Agarwal, D. Belgrave, K. Cho, M. Meila, T. Zhang, I. Gurevych, Y. Miyao, J. Dy, A. Krause, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin, A. H. Oh, A. Agarwal, D. Belgrave, K. Cho, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, W. Che, J. Nabende, E. Shutova, M. T. Pilehvar, L. Ku, A. Martins, V. Srikumar, J. Tan, M. Toussaint, and K. Darvish (Eds.) , Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchNIPS ’20Proceedings of Machine Learning ResearchNIPS’16NIPS ’21NIPS ’20MM ’25NIPS ’22ICML’24Proceedings of Machine Learning ResearchNIPS’14ICMI ’18 , Vol. 11911913980332635229 , pp. 1597–1607 . External Links: Link Cited by: §1 , §2.1 .
[8] S. Cho, J. Jeon, M. Kim, and J. Kim (2025) Synergy-clip: extending clip with multi-modal integration for robust representation learning . IEEE Access 13 , pp. 65630–65642 . External Links: Link Cited by: §2.1 .
[9] G. Cicchetti, E. Grassucci, and D. Comminiello (2025) A TRIANGLE enables multimodal alignment beyond cosine similarity . External Links: Link Cited by: §1 , §2.1 .
[10] G. Cicchetti, E. Grassucci, L. Sigillo, and D. Comminiello (2025) Gramian multimodal representation learning and alignment . In The Thirteenth International Conference on Learning Representations , External Links: Link Cited by: §1 , §2.1 .
[11] T. M. Cover and J. A. Thomas (2005) Elements of information theory . External Links: Link Cited by: §A.1 , §A.1 .
[12] L. Dong (2026) A unified information bottleneck framework for multimodal biomedical machine learning . Entropy 28 ( 4 ). External Links: Link , ISSN 1099-4300 Cited by: §2.2 .
[13] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra (2023) ImageBind one embedding space to bind them all . 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 15180–15190 . External Links: Link Cited by: §2.1 .
[14] T. S. Han (1975) Linear dependence structure of the entropy space . Information and Control 29 ( 4 ), pp. 337–368 . External Links: ISSN 0019-9958 , Document , Link Cited by: §3.2 , Remark 3.1 .
[15] T. S. Han (1978) Nonnegative entropy measures of multivariate symmetric correlations . Information and Control 36 ( 2 ), pp. 133–156 . External Links: ISSN 0019-9958 , Document , Link Cited by: §A.4 .
[16] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition . External Links: 1512.03385 , Link Cited by: §B.2 .
[17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory . Neural Computation 9 ( 8 ), pp. 1735–1780 . External Links: Document Cited by: §B.2 .
[18] S. Kullback and R. A. Leibler (1951) On information and sufficiency . Annals of Mathematical Statistics 22 , pp. 79–86 . External Links: Link Cited by: Theorem 3.4 .
[19] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019) Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks . pp. 8943–8950 . Cited by: §B.1 .
[20] M. A. Lee, B. Yi, R. Martín-Martín, S. Savarese, and J. Bohg (2020) Multimodal sensor fusion with differentiable filters . pp. 10444–10451 . External Links: Link , Document Cited by: §B.1 .