Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

摘要

选中正文可添加批注

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most $\widetilde{O}(\varepsilon^{-k \vee 2})$ samples to achieve $\varepsilon$ error in 1-Wasserstein distance, where $k$ is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly improves upon prior theoretical guarantees that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without imposing smoothness, bounded-density, or log-concavity assumptions. Overall, our results show that diffusion models can statistically adapt to intrinsic low-dimensional structure while naturally accommodating multi-modal data, offering a rigorous theoretical justification for their success in complex high-dimensional learning tasks.

核心问题与主要方法

核心问题

sample complexity and statistical optimality of diffusion models for low-dimensional multi-modal data

场景：score-based diffusion on distributions supported on a finite union of low-dimensional subspaces, with subgaussian restriction on each subspace

主要方法

Uses Gaussian smoothing / variance-exploding score estimation as the analytical target, then transfers score estimates to the OU reverse diffusion setting. Decomposes the smoothed score over union-of-subspaces components via posterior-like mixture weights w_t(i,x) and component scores s_t(i,x). Exploits a normal-tangent decomposition: the normal component has a closed Gaussian form, while the tangent component is reduced to estimating a k_i-dimensional smoothed score. Builds low-dimensional Gaussian kernel density score estimators with thresholding and clipping to control instability in low-density regions. Uses sample splitting: n0 samples for subspace recovery/classification and N=n-n0 samples for score estimation. Converts finite-sample L2 score error into W1 sampling error through a reverse-diffusion Wasserstein stability bound.

关键贡献与后续阅读

关键贡献

Establishes an intrinsic-dimension sample-complexity guarantee for diffusion sampling on multi-modal union-of-subspaces distributions, with \widetilde{O}(\varepsilon^{-(k\vee 2)}) samples for epsilon W1 error. Shows that diffusion score estimation can avoid ambient-dimensional rates by reducing component score estimation to k_i-dimensional smoothed distributions after subspace recovery. Extends low-dimensional diffusion theory beyond single-subspace/manifold settings to heterogeneous multi-modal structure without requiring smooth densities, bounded density away from zero, log-concavity, or Gaussian components. Provides a finite-sample L2 score estimation analysis for a regularized kernel estimator and links it to end-to-end continuous-time reverse diffusion sampling guarantees. Positions the rate as near-minimax optimal up to logarithmic factors by comparing against intrinsic k-dimensional minimax learning rates.

研究启发

How restrictive are the exact subspace recovery and separation/identifiability conditions needed by the subspace clustering step? Does the linear ambient-dimension prefactor in the W1 bound disappear under sharper analysis, or is some ambient dependence unavoidable for this target class? Can the kernel score estimator's normal-tangent decomposition be converted into a neural-network approximation result with comparable intrinsic-dimension sample complexity? How robust are the guarantees when the data are only near a union of subspaces rather than exactly supported on one?

限制与不确定性

Main setting is idealized: noiseless finite union of linear subspaces with subgaussian components. Exact subspace recovery assumption may limit practical applicability of the strongest bound. Kernel score estimator appears mainly proof-oriented, so implications for neural diffusion practice may be indirect.

原文信息

正文记录 1

参考文献 43

最近更新 2026-05-30 13:21

查看正文预览

Abstract

Content selection saved. Describe the issue below:

Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most the order of O ~  ( ε − k ∨ 2 ) \widetilde{O}(\varepsilon^{-k\vee 2}) (up to some logarithmic factor) samples to achieve ε \varepsilon sampling error in 1-Wasserstein distance, where k k is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly im

查看参考文献

B. D. Anderson (1982) Reverse-time diffusion equation models . Stochastic Processes and their Applications 12 ( 3 ), pp. 313–326 . Cited by: §2.1 .
J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg (2021) Structured denoising diffusion models in discrete state-spaces . Advances in neural information processing systems 34 , pp. 17981–17993 . Cited by: §1 .
I. Azangulov, G. Deligiannidis, and J. Rousseau (2024) Convergence of diffusion models under the manifold hypothesis in high-dimensions . arXiv preprint arXiv:2409.18804 . Cited by: §A.2 , §1 , §1.2 , §4 .
N. Boffi, A. Jacot, S. Tu, and I. Ziemann (2025) Shallow diffusion networks provably learn hidden low-dimensional structure . In International Conference on Learning Representations , Vol. 2025 , pp. 52889–52923 . Cited by: §1.2 .
B. C. Brown, A. L. Caterini, B. L. Ross, J. C. Cresswell, and G. Loaiza-Ganem (2022) Verifying the union of manifolds hypothesis for image data . arXiv preprint arXiv:2207.02862 . Cited by: §1 , §2.2 .
C. Cai and G. Li (2025) Minimax optimality of the probability flow ode for diffusion models . arXiv preprint arXiv:2503.09583 . Cited by: §A.1 , §A.2 , 1st item , §B.2 , §B.3 , §B.3 , §B.3 , §B.3 , Appendix B , §1 , §1.2 , 2nd item , §3.2 , Remark 3 .
C. Cai and G. Li (2026) Confidence-based decoding is provably efficient for diffusion language models . arXiv preprint arXiv:2603.22248 . Cited by: §1.2 .
M. Chen, K. Huang, T. Zhao, and M. Wang (2023) Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data . In International Conference on Machine Learning , pp. 4672–4712 . Cited by: §1 , §1.2 , §3.1 .
M. Chen, R. Xu, Y. Xu, and R. Zhang (2025a) Diffusion factor models: generating high-dimensional returns with factor structure . arXiv preprint arXiv:2504.06566 . Cited by: §1 .
S. Chen, K. Cong, and J. Li (2025b) Optimal inference schedules for masked diffusion models . arXiv preprint arXiv:2511.04647 . Cited by: §1.2 .
S. Chen, V. Kontonis, and K. Shah (2024) Learning general gaussian mixtures with efficient score matching . arXiv preprint arXiv:2404.18893 . Cited by: §1 .
S. Chewi, J. Niles-Weed, and P. Rigollet (2024) Statistical optimal transport . arXiv preprint arXiv:2407.18163 3 . Cited by: §1.1 , 1st item .
F. Cole and Y. Lu (2024) Score-based generative models break the curse of dimensionality in learning a family of sub-gaussian probability distributions . arXiv preprint arXiv:2402.08082 . Cited by: §1 .
D. Dmitriev, Z. Huang, and Y. Wei (2026) Efficient sampling with discrete diffusion models: sharp and adaptive guarantees . arXiv preprint arXiv:2602.15008 . Cited by: §1.2 .
Z. Dou, S. Kotekal, Z. Xu, and H. H. Zhou (2024) From optimal score matching to optimal sampling . arXiv preprint arXiv:2409.07032 . Cited by: §1 , §1.2 , 2nd item , §3.2 .
E. Elhamifar and R. Vidal (2013) Sparse subspace clustering: algorithm, theory, and applications . IEEE transactions on pattern analysis and machine intelligence 35 ( 11 ), pp. 2765–2781 . Cited by: §3.1 .
J. Fan, Y. Gu, and X. Li (2025) Optimal estimation of a factorizable density using diffusion models with relu neural networks . arXiv preprint arXiv:2510.03994 . Cited by: §1 , §1.2 .
R. Heckel and H. Bölcskei (2015) Robust subspace clustering via thresholding . IEEE transactions on information theory 61 ( 11 ), pp. 6320–6342 . Cited by: §3.1 .
J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models . Advances in neural information processing systems 33 , pp. 6840–6851 . Cited by: §1 .
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet (2022) Video diffusion models . Advances in neural information processing systems 35 , pp. 8633–8646 . Cited by: §1 .