Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

摘要

In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

核心问题与主要方法

核心问题

Estimate a token-level conditional distribution in federated language modeling under per-node uplink bit budgets, and characterize optimal bit allocation

场景：Federated probe-logit distillation on K nodes with n local samples, public probe set size m, vocabulary size V, and scalar-quantized logits under homogeneous or heterogeneous bandwidth budgets

主要方法

Uses the softmax-KL cumulant expansion together with the softmax-Hessian trace bound tr(H_soft) <= 1 to avoid a loose vocabulary-size factor in the bandwidth term. Models uplink communication as per-coordinate clipped scalar quantization with subtractive dithering, so independent mean-zero quantization errors average across K nodes and yield the K^-1 factor. Derives the single-round lower bound by composing dither variance with a lower trace condition under non-degeneracy, matching the trace-sharpened upper bound for the construction. Improves the multi-round rate through nested/scaled residual quantizers, effectively turning B/V bits per coordinate per round into TB/V effective bits after T rounds. Solves heterogeneous bit allocation by minimizing a weighted sum of exponential distortion terms, yielding a Lagrangian log-tilted water-filling allocation. Builds an adaptive plug-in variant by estimating per-node second-moment weights from warm-up quantized probe logits and applying concentration plus a relative-error transfer argument.

关键贡献与后续阅读

关键贡献

Pins down the single-round bandwidth-axis rate for dithered FPLD as Theta(K^-1 2^-2B/V) by matching the prior trace-sharpened upper bound with a construction-level lower bound. Shows that vanilla multi-round FPLD is bandwidth-suboptimal when T > 1 and that nested/scaled residual quantization can improve the bandwidth term to O(K^-1 2^-2TB/V). Extends the homogeneous FPLD KL upper bound to heterogeneous per-node bandwidths with bandwidth contribution (c3/K^2) sum_i 2^-2B_i/V. Derives a closed-form optimal allocation rule B_i* = B_tot/K + (V/2) log2(w_i / w_g) for weighted heterogeneous quantization distortion. Provides a two-stage adaptive allocation procedure that estimates weights from warm-up probe-logit transmissions and achieves near-optimal relative bandwidth distortion under the stated concentration assumptions.

研究启发

Does the lower bound extend beyond the dithered scalar-quantized FPLD protocol class to arbitrary B-bit encoders, vector quantizers, or interactive protocols? How sensitive is the water-filling allocation to misspecified or non-stationary weights in realistic federated language-model deployments? Do the rate improvements from sequential residual quantization survive real local training noise, non-i.i.d. nodes, and imperfect distillation fits? Can the trace-sharpened constants and small-error conditions be verified or estimated in practical LM settings?

限制与不确定性

Lower bound is for the dithered FPLD construction rather than an unconditional minimax lower bound over all B-bit protocols. Empirical validation is limited to synthetic n-gram simulations, with real language-model evaluation and robustness/privacy outside scope. Importance depends partly on how central FPLD becomes, since the result is specialized to a particular federated distillation setting.

参考文献

8 条

J. Acharya, C. L. Canonne, and H. Tyagi (2020) Inference under information constraints i: lower bounds from chi-square contraction . IEEE Transactions on Information Theory 66 ( 12 ), pp. 7835–7855 . External Links: Document Cited by: §1 , §7 .
P. Dubey and X. Huo (2026) Federated language models under bandwidth budgets: distillation rates and conformal coverage . External Links: 2605.09986 , Link Cited by: §B.5 , Appendix B , §1 , §1 , §3 .
A. Garg, T. Ma, and H. L. Nguyễn (2014) On communication cost of distributed statistical estimation and dimensionality . In Advances in Neural Information Processing Systems , Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger (Eds.) , Vol. 27 . External Links: Link Cited by: §1 , §7 .
C. Huang and X. Huo (2019) A distributed one-step estimator . Mathematical Programming 174 ( 1–2 ), pp. 41–76 . External Links: Document Cited by: §1 , §7 .
Y. Li, X. Wang, W. Xu, H. Wang, Y. Qi, J. Dong, and R. Li (2025) Feature distillation is the better choice for model-heterogeneous federated learning . In Advances in Neural Information Processing Systems , D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.) , Vol. 38 , pp. 104726–104744 . External Links: Link Cited by: §7 .
T. Lin, L. Kong, S. U. Stich, and M. Jaggi (2020) Ensemble distillation for robust model fusion in federated learning . In Advances in Neural Information Processing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.) , Vol. 33 , pp. 2351–2363 . External Links: Link Cited by: §7 .
Y. Zhang, J. C. Duchi, and M. J. Wainwright (2013) Communication-efficient algorithms for statistical optimization . Journal of Machine Learning Research 14 ( 104 ), pp. 3321–3363 . External Links: Link Cited by: §1 , §7 .
Supplementary Material