How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

核心问题与主要方法

核心问题

How to measure and optimize the value of a training dataset subset for downstream learning performance

场景：Submodular data-appraisal over finite subsets using PSD kernel or linear-kernel embeddings, evaluated under fixed-size, class-balanced, and fixed-training-budget regimes

主要方法

Recasts dataset appraisal objectives as set functions over finite subsets and uses diminishing returns/submodularity to justify greedy optimization guarantees. Defines matrix spectral functions by applying a scalar spectral function phi to eigenvalues of a PSD subset kernel matrix and summing via trace. Uses the condition that -phi' is matrix monotone to obtain submodularity; relaxes this with weak matrix monotonicity to obtain zeta-weakly DR submodularity. Uses rank-1 eigenvalue updates via the secular equation, plus deflation and Householder-reflector handling for repeated or zero eigenvalues, to avoid repeated full eigendecompositions during greedy search. Empirically spans low-to-high objective values through direct maximization, heuristic minimization, random selection, and indirect proxy sampling to test whether appraisal scores predict downstream accuracy.

关键贡献与后续阅读

关键贡献

Shows that several neural scaling-law data-value formulations become submodular set functions when loss is converted to value, highlighting that these laws value data mostly through cardinality or domain counts. Places the Shannon/q=1 log Vendi score in a matrix spectral submodular framework alongside log-determinantal objectives, clarifying the shared spectral structure and the difference between Vendi entropy and DPP log-volume scoring. Introduces weakly matrix monotone functions as a sufficient route to weakly DR submodular matrix spectral objectives, extending usable objectives beyond fully matrix-monotone cases. Develops a secular-equation-based greedy evaluation scheme for matrix spectral objectives, giving an O(m) factor improvement per marginal query relative to repeated eigendecomposition in m-dimensional embeddings. Provides empirical evidence that direct Vendi maximization at high objective values can fail as a downstream-performance proxy, while facility location is more reliable under fixed-size, class-balanced, and fixed-budget regimes.

研究启发

How sensitive are the empirical conclusions to the embedding choice, especially normalized versus unnormalized gradients and modern foundation-model features? Can the Renyi-order Vendi scores be proven submodular or weakly DR submodular, or do counterexamples exist? Can the secular-update strategy be generalized to nonlinear kernels without losing the practical scaling advantage? Does facility location remain the best downstream proxy on larger language-model pretraining corpora where O(n^2) similarity storage is prohibitive?

限制与不确定性

Primary category is cs.LG and the main practical target is dataset valuation, so information-theory centrality is medium rather than high. Some important extensions are unresolved, including non-Shannon Renyi Vendi orders and general nonlinear-kernel scalability. The strongest empirical result favors facility location over the proposed spectral variants, which may reduce urgency for deep technical follow-up unless dataset appraisal is a priority.

参考文献

101 条

[1] P. K. Agarwal, S. Har-Peled, K. R. Varadarajan, et al. (2005) Geometric approximation via coresets . Combinatorial and computational geometry 52 ( 1 ), pp. 1–30 . Cited by: Appendix B .
[2] K. Audenaert, F. Hiai, and D. Petz (2010) Strongly subadditive functions . Acta Mathematica Hungarica 128 ( 4 ), pp. 386–394 . Cited by: Appendix F , Theorem 3.1 , Theorem 3.2 , §3 .
[3] S. A. Azcoitia and N. Laoutaris (2022) Try before you buy: a practical data purchasing algorithm for real-world data marketplaces . In Proceedings of the 18th International Conference on Emerging Networking Experiments and Technologies , Cited by: Appendix C .
[4] O. Bachem, M. Lucic, and A. Krause (2017) Practical coreset constructions for machine learning . arXiv preprint arXiv:1703.06476 . Cited by: Appendix C .
[5] J. Bendat and S. Sherman (1955) Monotone and convex operator functions . Transactions of the American Mathematical Society 79 ( 1 ), pp. 58–71 . Note: Good description of Loewner’s result and gives an easy test, from Loewner, to find a counterexample for function being matrix monotone Cited by: Appendix F , §3 .
[6] C. H. Bennett and P. W. Shor (2002) Quantum information theory . IEEE transactions on information theory 44 ( 6 ), pp. 2724–2742 . Cited by: Appendix J , §1 .
[7] R. Bhatia and T. Sano (2009) Loewner matrices and operator convexity . Mathematische Annalen 344 ( 3 ), pp. 703–716 . Note: Another very good description of Loewner’s result and gives an easy test, from Loewner, to find a counterexample for function being matrix monotone Cited by: Theorem F.1 , Theorem 3.2 .
[8] R. Bhatia (1997) Matrix analysis . Springer Science & Business Media . Cited by: Theorem 3.2 , §3 , §3 .
[9] R. Bhatia (2009) Positive definite matrices . In Positive Definite Matrices , Cited by: Appendix F , §3 , §3 .
[10] A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschiatschek (2017) Guarantees for greedy maximization of non-submodular functions with applications . In International conference on machine learning , pp. 498–507 . Cited by: Appendix B .
[11] J. Bilmes (2022) Submodularity in machine learning and artificial intelligence . arXiv preprint arXiv:2202.00132 . Cited by: §K.7 , Appendix B , Appendix B , Appendix D , §1 , §3.2 .
[12] J. Bilmes (2026) Submarine: SUBModularity for ARtificial INtelligencE and machine learning . Note: Online Software System https://submarine.page Cited by: Appendix B .
[13] J. Bilmes and W. Bai (2017) Deep submodular functions . arXiv preprint arXiv:1701.08939 . Cited by: Appendix D .
[14] I. Bogunovic, J. Zhao, and V. Cevher (2018) Robust maximization of non-submodular objectives . In International Conference on Artificial Intelligence and Statistics , pp. 890–899 . Cited by: Theorem B.1 .
[15] L. Bottou (2007) Large-scale kernel machines . MIT press . Cited by: Appendix F , §3.2 .
[16] F. Brechenmacher (2014) Lagrange and the secular equation . Lettera Matematica 2 ( 1 ), pp. 79–91 . Cited by: §H.1.1 .
[17] J. R. Bunch, C. P. Nielsen, and D. C. Sorensen (1978) Rank-one modification of the symmetric eigenproblem . Numerische Mathematik 31 ( 1 ), pp. 31–48 . Cited by: §H.1.1 , §H.2 , §3.3 .
[18] L. Chen, G. Zhang, and E. Zhou (2018) Fast greedy map inference for determinantal point process to improve recommendation diversity . Advances in neural information processing systems 31 . Cited by: Appendix N , §3.3 .
[19] A. Das and D. Kempe (2011) Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection . In Proceedings of the 28th International Conference on International Conference on Machine Learning , pp. 1057–1064 . Cited by: Definition B.1 , Appendix B .
[20] A. Das and D. Kempe (2018) Approximate submodularity and its applications: subset selection, sparse approximation and dictionary selection . Journal of Machine Learning Research 19 ( 3 ), pp. 1–34 . Cited by: Appendix B .
[21] J. W. Demmel (1997) Applied numerical linear algebra . SIAM . Cited by: §H.1.1 , §H.1.1 , §H.1.2 , §H.1.2 , §H.1.2 , §H.2 , Theorem H.2 , Theorem H.3 , §3.3 .
[22] N. Dey, G. Gosal, Zhiming, Chen, H. Khachane, W. Marshall, R. Pathria, M. Tom, and J. Hestness (2023) Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster . External Links: 2304.03208 , Link Cited by: §2 .
[23] A. Dharmasiri, W. Yang, P. Kirichenko, L. T. Liu, and O. Russakovsky (2025) The impact of coreset selection on spurious correlations and group robustness . In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , Cited by: Appendix C .
[24] J. Ding and A. Zhou (2007) Eigenvalues of rank-one updated matrices with some applications . Applied Mathematics Letters 20 ( 12 ), pp. 1223–1226 . Cited by: §H.1.1 .
[25] M. El Halabi, F. Bach, and V. Cevher (2018) Combinatorial penalties: which structures are preserved by convex relaxations? . In International Conference on Artificial Intelligence and Statistics , pp. 1551–1560 . Cited by: Theorem B.1 .
[26] M. El Halabi, F. Bach, and V. Cevher (2018-09–11 Apr) Combinatorial penalties: which structures are preserved by convex relaxations? . In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics , A. Storkey and F. Perez-Cruz (Eds.) , Proceedings of Machine Learning Research , Vol. 84 , pp. 1551–1560 . External Links: Link Cited by: Appendix B .
[27] M. El Halabi and S. Jegelka (2020) Optimal approximation for unconstrained non-submodular minimization . In International Conference on Machine Learning , pp. 3961–3972 . Cited by: Appendix B , Definition 3.1 .
[28] S. Friedland and S. Gaubert (2011) Submodular spectral functions of principal submatrices of a hermitian matrix, extensions and applications . Linear Algebra and its Applications . Cited by: Appendix F , Theorem 3.1 , Theorem 3.2 , §3 .
[29] D. Friedman and A. B. Dieng (2022) The vendi score: a diversity evaluation metric for machine learning . arXiv preprint arXiv:2210.02410 . Cited by: §3.1 .
[30] D. Friedman and A. B. Dieng (2023) The vendi score: a diversity evaluation metric for machine learning . Transactions on Machine Learning Research . Cited by: Appendix E , §1 , §3.1 .