Model Merging Scaling Laws in Large Language Models

ABSTRACT

This work studies empirical scaling laws for language model merging measured by cross-entropy. Across 10,866 merged models, base sizes from 0.5B to 72B, nine domains, and four methods, the paper identifies a compact floor-plus-tail power law that links model size and expert count. The law explains why most merging gains arrive early, why variability shrinks as more experts are included, and how to plan expert count under a budget.

Model MergingScaling LawsLarge Language Models
Apr 30, 2026
Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang
Scaling law for Average mergingScaling law for Task Arithmetic mergingScaling law for TIES mergingScaling law for DARE merging

Model merging scaling law across Average, Task Arithmetic, TIES, and DARE. Dots are measured results; dotted lines are unified-law fits.


Abstract

This work studies empirical scaling laws for language model merging, measured by cross-entropy. Although model merging is widely used in practice, it lacks a quantitative rule that predicts returns as more experts are added or as the base model scales.

The paper identifies a compact power law that links model size and expert count. The size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, fits measured curves across diverse architectures and methods, and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included.

The resulting law enables predictive planning: estimating how many experts are needed to reach a target loss, deciding when to stop adding experts, and trading off scaling the base model versus adding experts under a fixed budget.

Introduction

Large language models are often specialized by fine-tuning on different domains, producing multiple domain experts. Model merging combines these experts in weight space to synthesize a single model without retraining. It supports modular pipelines, can approximate joint training at a fraction of the cost, and enables composition under privacy or compute constraints.

However, merging is still largely empirical. Practitioners experiment with expert subsets, orders, and normalization rules, often at substantial computational expense. Unlike pretraining, where scaling laws guide tradeoffs among model size, data, and compute, merging has lacked a quantitative account that predicts convergence as experts are added.

This paper introduces a predictive merging scaling law that couples base model size NN with the number of merged experts kk:

E[LN,k]=L+BNβfloor L(N)+A0Nγk+bmerging tail\mathbb{E}[L\mid N,k] = \underbrace{L_* + B N^{-\beta}}_{\text{floor }L_\infty(N)} + \underbrace{\frac{A_0 N^{-\gamma}}{k+b}}_{\text{merging tail}}

Larger base models lower the size-dependent floor and shrink the tail amplitude. Adding experts yields steep early improvements, then tapers roughly as 1/(k+b)1/(k+b).

Contributions

  • Unified scaling law: a compact floor-plus-tail law links base size and expert count, and applies consistently in in-domain and cross-domain settings.
  • Large-scale validation: the law is validated over 10,866 merged models, seven Qwen2.5 sizes from 0.5B to 72B, nine domains, and four merging methods.
  • Theory: a leading-order inverse-kk tail and variance contraction are derived under equal-normalized composition of effective updates.
  • Operational recipe: a lightweight three-point fitting procedure predicts the full merge curve and recommends an efficient expert count for budget-aware planning.

Background and Setup

Model Merging

Model merging integrates independently trained models into a single model by aggregating parameters. The paper studies:

  • Average: direct equal-weight averaging of task vectors;
  • Task Arithmetic (TA): scaled task-vector composition;
  • TIES: trimming, electing, and disjoint merging to reduce interference;
  • DARE: random masking and rescaling of task vectors.

The unified view is:

θ=θ0+iKαi,kΨ(vi),iKαi,k=c\theta = \theta_0 + \sum_{i\in K}\alpha_{i,k}\Psi(v_i), \qquad \sum_{i\in K}\alpha_{i,k}=c

Here, θ0\theta_0 is the base model, viv_i is the task vector for expert ii, KK is a selected subset of kk experts, and Ψ\Psi is the method-specific preprocessing map.

Expert Models and Data

The controlled experiments start from Qwen2.5 base models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. The authors train nine domain specialists using data from Mixture-of-Thoughts and OpenScience:

  • mathematics: algebra, analysis, discrete mathematics and combinatorics, geometry and topology, number theory;
  • science: biology, physics, chemistry;
  • code.

Evaluation uses token-level cross-entropy. For each domain, 30M held-out tokens are scored and averaged. For each kk, the expected merge loss is computed over all possible kk-expert subsets when feasible, or over a large uniform sample for larger models.

Scaling Law Results

Expected Loss Construction

For a fixed base size NN and expert count kk, there are (Mk)\binom{M}{k} possible expert subsets. Each subset can yield a distinct merged model. The expected merge loss is:

E^[LN,k]=1SN,ks=1SN,kL(N,k,s)\widehat{\mathbb{E}}[L\mid N,k] = \frac{1}{S_{N,k}}\sum_{s=1}^{S_{N,k}}L(N,k,s)

Individual subset losses vary, but the per-kk mean forms a smooth curve with diminishing returns.

Empirical construction and in-domain scaling example

Empirical construction of expected loss and in-domain scaling on Qwen2.5 models.

Unified Empirical Law

The expected loss follows:

E[LN,k]=L(N)+A(N)k+b,b0\mathbb{E}[L\mid N,k] = L_\infty(N)+\frac{A(N)}{k+b}, \qquad b\ge0

The model-size dependencies are:

L(N)=L+BNβ,A(N)=A0NγL_\infty(N)=L_*+B N^{-\beta}, \qquad A(N)=A_0N^{-\gamma}

This gives a simple interpretation:

  • bigger models lower the asymptotic floor;
  • bigger models shrink the remaining tail;
  • adding experts gives early gains that rapidly diminish.

Across methods and settings, the paper reports near-unity fits, with R2>0.98R^2 > 0.98 over fitted points.

Merging Versus Multitask SFT

Overview of merging versus multitask SFT

Merging approaches multitask SFT performance while using negligible GPU-hours.

The paper directly compares merging with multitask SFT under normalized loss and GPU-hours. On a 72B model with nine domains, multitask SFT costs roughly 1300 H800 GPU hours, while merging costs less than one GPU-hour for the reported methods.

Theory

The paper explains the inverse-kk tail with an average-case argument. Under equal normalization, merging corresponds to averaging task update vectors. As kk increases, the variance of the averaged update shrinks as 1/k1/k. A second-order Taylor expansion of the loss converts this variance reduction into expected-loss improvement of the same order.

For each fixed model size NN, the theorem states:

E[LN,k]=L(N)+A(N)k+ON(k3/2)\mathbb{E}[L\mid N,k] = L_\infty(N)+\frac{A(N)}{k} +\mathcal{O}_N(k^{-3/2})

with:

L(N)=L(θ0;N)+cgμ+12c2μHμL_\infty(N) = L(\theta_0;N)+c\,g^\top\mu+\frac{1}{2}c^2\mu^\top H\mu

and:

A(N)=12c2Tr(HΣ)A(N) = \frac{1}{2}c^2\mathrm{Tr}(H\Sigma)

Here, μ\mu and Σ\Sigma are the mean and covariance of task updates in the merged subspace. For TIES and DARE, the same analysis is applied to effective updates after method-specific preprocessing.

The variance result is:

Var(L(θ0+Δθk;N))=Θ(1k),sd=O(1k)\mathrm{Var}(L(\theta_0+\Delta\theta_k;N)) = \Theta\left(\frac{1}{k}\right), \qquad \mathrm{sd} = \mathcal{O}\left(\frac{1}{\sqrt{k}}\right)

This explains why merging more experts improves not only average performance but also reliability.

Core Findings

Larger Models Are Easier to Merge

Per-domain floors and tail amplitudes as functions of model size

Per-domain floors and tail amplitudes as functions of model size.

At fixed kk, larger models have lower CE and need fewer experts to approach the floor. At k=9k=9, domain-averaged CE drops from 0.739 at 0.5B to 0.430 at 32B, a 41.9% reduction.

Fractional return and k to reach 90 percent return

Most gains arrive early: k=5 and k=6 cross the 85% and 90% return thresholds.

The paper finds that k=5k=5 and k=6k=6 reach about 85% and 90% of the measured improvement. Roughly 60% of the nine-expert pool is enough to recover over 90% of the gains.

Mixing Domains Helps Generalization

Cross-domain merging follows the same law as in-domain merging. Gains are monotone in kk, steep early, and flatten into a 1/(k+b)1/(k+b) tail. Diverse donor domains reduce domain-specific bias and help pooled generalization under the same scaling form.

Method Gaps Shrink at Scale

Mean CE versus k at 32BMerge-to-merge variance versus k at 32B
Mean CE versus model size at k equals 3

Method sensitivity diminishes at scale: small early-k gaps narrow quickly, while variance contracts for all methods.

At larger kk and NN, method gaps compress quickly. Early advantages for TA or TIES at small kk shrink to a tight band by k8k\approx8, with differences below about 2%. Variance shows similar convergence.

Practical Recipe

Three Points Predict the Curve

The paper fits:

L(k)=L(N)+A(N)k+bL(k)=L_\infty(N)+\frac{A(N)}{k+b}

using only three early points: k{1,2,4}k\in\{1,2,4\}. This forecast closely tracks the full k{1,,9}k\in\{1,\ldots,9\} trajectory.

Ground truth versus floor-tail fits from three points

Fitting on k=1,2,4 closely predicts the full k-curve.

Forecast error and recommended k distribution

Forecast errors stay low and recommended k concentrates around small values.

This turns merge planning into a lightweight measurement problem: evaluate a few early expert counts, fit the curve, then decide when additional experts no longer justify their cost.

Merge Order Matters Less as k Grows

CE distribution across merge ordersAcross-order standard deviation heatmapWorst-best range at representative sizes

Order sensitivity contracts rapidly as k increases under DARE.

Under DARE, across-order dispersion shrinks rapidly with kk. Once k6k\ge6, the spread from merge order is small compared with early method gaps and the scaling-law floor.

The Law Transfers Across Backbones

LLaMA macro CE versus kLLaMA marginal gain versus k

The same inverse-tail law holds on LLaMA-3.2 3B and LLaMA-3 8B.

On LLaMA-3.2 3B and LLaMA-3 8B, macro CE again follows the same inverse-tail law. The paper reports R2=0.999R^2=0.999 for LLaMA-3.2 3B and R2=0.995R^2=0.995 for LLaMA-3 8B.

Appendix: Reproducibility

All models and datasets used in the study are publicly available. The paper describes the data sources, methodological choices, and evaluation protocol in the main setup section, then gives extra implementation details and hyperparameters in the appendix. The complete source code is provided as supplementary material, and the checkpoints are planned for release.

Statement of LLM Usage

The authors report using an LLM only as an editing tool for syntax correction and stylistic polishing. The LLM was not used to generate or revise the central research ideas, design experiments, or organize the paper.

Model Merging Recipes

The paper represents all merging methods with:

θ=θ0+iKαi,kΨ(vi),iKαi,k=c\theta=\theta_0+\sum_{i\in K}\alpha_{i,k}\Psi(v_i), \qquad \sum_{i\in K}\alpha_{i,k}=c
MethodEffective update Ψ(v)\Psi(v)ccα\alphaExtra parameters
Averagevv11/k1/knone
Task Arithmeticvv0.81/k1/knone
TIEStrim, elect, disjoint merge11/k1/kdensity d=1.0d=1.0
DAREmv/(1p)m\odot v/(1-p)11/k1/kdrop rate p=0.2p=0.2

This table is the implementation-level bridge between the empirical law and specific merging algorithms: Average and TA differ mainly by scale, TIES modifies the effective update by sparsifying and resolving signs, and DARE uses random masking plus rescaling before the normalized merge.

Detailed Theory

The appendix proves the inverse-tail law for a fixed model size NN. The proof assumes the loss is twice continuously differentiable near the base model, the Hessian is Lipschitz, task vectors have finite sixth moment, and equal-normalized weights satisfy αi,k=c/k\alpha_{i,k}=c/k.

Let:

Δθk(S)=iSckvi=cμ+εk(S),εk(S)=ckiS(viμ)\Delta\theta_k(S)=\sum_{i\in S}\frac{c}{k}v_i=c\mu+\varepsilon_k(S), \qquad \varepsilon_k(S)=\frac{c}{k}\sum_{i\in S}(v_i-\mu)

The mean-corrected step has:

E[εk]=0,E[εkεk]=c2kΣ,Eεk3=O(k3/2)\mathbb{E}[\varepsilon_k]=0, \qquad \mathbb{E}[\varepsilon_k\varepsilon_k^\top]=\frac{c^2}{k}\Sigma, \qquad \mathbb{E}\|\varepsilon_k\|^3=\mathcal{O}(k^{-3/2})

Taylor expanding at θ0+cμ\theta_0+c\mu gives:

L(θ0+cμ+δ)=L(θ0+cμ)+aδ+12δHSδ+RS(δ),RS(δ)M6δ3L(\theta_0+c\mu+\delta) = L(\theta_0+c\mu) +a^\top\delta +\frac12\delta^\top H_S\delta +R_S(\delta), \qquad |R_S(\delta)|\le\frac{M}{6}\|\delta\|^3

Substituting δ=εk(S)\delta=\varepsilon_k(S) and taking expectation removes the linear term and leaves:

E[L(θk(S))]=L(θ0+cμ)+12c2Tr(HSΣ)1k+O(k3/2)\mathbb{E}[L(\theta_k(S))] = L(\theta_0+c\mu) +\frac12c^2\mathrm{Tr}(H_S\Sigma)\frac{1}{k} +\mathcal{O}(k^{-3/2})

The main text writes the intercept and tail at the base point using a PSD curvature surrogate HH:

L(N)=L(θ0)+cgμ+12c2μHμ,A(N)=12c2Tr(HΣ)L_\infty(N)=L(\theta_0)+cg^\top\mu+\frac12c^2\mu^\top H\mu, \qquad A(N)=\frac12c^2\mathrm{Tr}(H\Sigma)

The appendix then bounds the base-point approximation error by the Hessian Lipschitz constant, μ\|\mu\|, and curvature-surrogate mismatch. At the empirical granularity used in the paper, these are absorbed into fitted L(N)L_\infty(N) and A(N)A(N), yielding:

E[LN,k]=L(N)+A(N)k+ON(k3/2)\mathbb{E}[L\mid N,k] = L_\infty(N)+\frac{A(N)}{k} +\mathcal{O}_N(k^{-3/2})

For variance, the same expansion decomposes:

L(θk(S))=C+aεk+12εkHSεk+RS(εk)L(\theta_k(S))=C+a^\top\varepsilon_k+\frac12\varepsilon_k^\top H_S\varepsilon_k+R_S(\varepsilon_k)

The leading linear term contributes:

Var(aεk)=c2kaΣa\mathrm{Var}(a^\top\varepsilon_k)=\frac{c^2}{k}a^\top\Sigma a

The quadratic term contributes O(k2)\mathcal{O}(k^{-2}), the remainder contributes O(k3)\mathcal{O}(k^{-3}), and the covariance terms are lower order. Thus, in the non-degenerate case:

Var(L(θk(S)))=Θ(1/k),sd(L(θk(S)))=O(k1/2)\mathrm{Var}(L(\theta_k(S)))=\Theta(1/k), \qquad \mathrm{sd}(L(\theta_k(S)))=\mathcal{O}(k^{-1/2})

If the linear term degenerates, the proof gives a tighter O(k2)\mathcal{O}(k^{-2}) variance bound, with tightness when the quadratic fluctuation is nonzero on the covariance subspace.

Expert Model Details

The expert models are trained with a shared recipe so that the scaling study isolates merging behavior rather than changing expert capacity. The hyperparameters are:

HyperparameterValue
Batch size16
Learning rate1×1051\times10^{-5}
Warmup ratio0.05
Epochs2
Maximum sequence length16,384
OptimizerAdam with offloading
Precisionbfloat16
Gradient checkpointingenabled
ZeRO stage3

Evaluation uses token-level cross-entropy. For each domain, 30M validation tokens are sampled. The domain-specific loss is the average negative log-likelihood over validation tokens:

Loverall=1iMTiiMt=1Tilogpθ(xtxt1,,x1)\mathcal{L}_{\text{overall}} = -\frac{1}{\sum_{i\in\mathcal{M}}T_i} \sum_{i\in\mathcal{M}}\sum_{t=1}^{T_i} \log p_\theta(x_t\mid x_{t-1},\ldots,x_1)

For every kk, there are (Mk)\binom{|\mathcal{M}|}{k} possible expert selections, and each selection may produce a different merged model. The expected loss is therefore computed over all subsets when feasible, and by sampling when the model size makes exhaustive merging too expensive.

The authors explicitly distinguish this weight-space merging study from model-fusion methods such as InfiGFusion and InfiFPO, which require data and additional training.

Sampling Algorithm

For large models, the paper samples diverse merge permutations instead of enumerating every possible subset/order. The algorithm initializes with the canonical sequence [1,,9][1,\ldots,9], adds the reverse sequence when k2k\ge2, then repeatedly samples 1000 random candidate permutations and chooses the one maximizing its minimum Hamming distance to the already selected set.

Input: k, base sequence s = [1, 2, ..., 9]
P <- {s}
if k >= 2:
  P <- P union {reverse(s)}
for i = 3 ... k:
  C <- 1000 random shuffles of s
  pi* <- argmax_pi min_{pi' in P} HammingDistance(pi, pi')
  P <- P union {pi*}
return P
Sampling strategy compared with full combinations

The sampled curves closely match full merging combinations on the 0.5B model.

Expert Post-Training Scaling

The paper also studies the scaling behavior of the domain expert models before merging. It relates expert loss to model size, training tokens, and compute budget. Larger models and larger post-training compute generally improve expert performance, consistent with standard language-model scaling laws.

Expert post-training scaling law

Expert post-training scaling: loss improves with model size and compute, but domains have different intrinsic loss levels.

The appendix highlights a domain-specific difference: Biology has substantially higher loss than Geometry under comparable training conditions, suggesting that different domains have different pre-existing knowledge reserves and may create heterogeneous merge dynamics.

Empirical Construction Details

The expected merge curve is built by plotting individual subset losses as light points and the per-kk mean as the fitted curve. As kk grows, individual losses still vary by subset, but the scatter narrows and the average remains smooth.

Average merging empirical construction at 0.5BAverage merging empirical construction at 32BDARE merging empirical construction at 0.5BDARE merging empirical construction at 32B

Representative empirical construction cases for Average and DARE at small and large base-model sizes.

In-Domain Fit Details

For each domain dd, the appendix fits:

E[LdN,k]=L,d+BdNβd+A0,dNγdk+bd\mathbb{E}[L_d\mid N,k] = L_{*,d}+B_dN^{-\beta_d} +\frac{A_{0,d}N^{-\gamma_d}}{k+b_d}

The floor term is summarized as L,d(N)=L,d+BdNβdL_{\infty,d}(N)=L_{*,d}+B_dN^{-\beta_d}, and the tail as Ad(N)=A0,dNγdA_d(N)=A_{0,d}N^{-\gamma_d}. Floors show tight power-law fits with exponents clustered around 0.33 to 0.42 and R2R^2 near 0.98 to 0.99. Tails are smaller and noisier, with Code showing the clearest decay with model size.

Domainb^\hat bA^0\hat A_0γ^\hat\gammaR2(A)R^2(A)L^\hat L_*B^\hat Bβ^\hat\betaR2(L)R^2(L)
algebra0.0000.0460-0.004-0.0020.17240.12480.3790.983
analysis0.0000.0462+0.009+0.0090.17930.12550.4170.990
biology0.1250.1741-0.006+0.0070.62270.63380.3620.988
chemistry0.0750.1317-0.006+0.0080.49240.56390.3310.988
code0.2500.0682+0.1150.5560.27050.22380.3780.986

At k=9k=9 under Average merging, macro CE decreases from 0.739 at 0.5B to 0.430 at 32B, a 41.9% reduction. The kεk_\varepsilon calculation illustrates why model size and tail amplitude should be considered separately: Code needs about eight experts at 0.5B and five experts at 32B for ε=0.01\varepsilon=0.01, while Biology has a nearly flat tail and needs roughly 18 experts despite the floor improving with NN.

Fractional Return and Expert Budget

The appendix computes the fraction of realized improvement:

R(N,d,k)=L(N,d,1)L(N,d,k)L(N,d,1)L(N,d,kmax)R(N,d,k)= \frac{L(N,d,1)-L(N,d,k)} {L(N,d,1)-L(N,d,k_{\max})}

using a monotone envelope of the measured CE curve. Median return reaches 85% at k=5k=5 and 90% at k=6k=6, with k90k_{90} concentrated in {5,6}\{5,6\} across domains and sizes.

Median fractional return with IQR bandk90 heatmap across domains and sizes

Most of the gain comes from the first few experts; k=6 recovers about 90% of the realized improvement.

The diminishing return follows from the floor-tail law: the marginal gain scales approximately as A(N)/[(k+b)(k+1+b)]A(N)/[(k+b)(k+1+b)], so returns decay roughly as k2k^{-2} after the first few experts.

Cross-Domain Fit Details

The cross-domain appendix fits the same functional form to pooled CE and to method-specific curves. Average, TA, TIES, and DARE all follow:

L(N,k)=L(N)+A(N)k+bL(N,k)=L_\infty(N)+\frac{A(N)}{k+b}

For TIES, the appendix notes that the strongest nonlinear cases may be better captured with a small bounded interference term:

D(N)kk+qD(N)\frac{k}{k+q}

The reported headline pattern remains unchanged: most pooled improvement is achieved by k6k\le6, method differences narrow as kk increases, and scaling NN lowers both the floor and the tail.

The appendix also analyzes whether a larger candidate pool helps. With 7- and 8-domain pools, the fitted floor and tail remain stable and the qualitative law is unchanged.

Candidate pool size 7 floor and tailCandidate pool size 8 floor and tail

Candidate-pool variants keep the same floor-plus-tail behavior.

Downstream Metrics

To test whether CE trends transfer to task quality, the paper evaluates merged checkpoints on math, reasoning, multilingual, coding, and safety benchmarks. For each backbone and merge method, it evaluates all expert subsets for k{1,,5}k\in\{1,\ldots,5\}, normalizes metrics so larger is better, and reports mean accuracy averaged across benchmarks and subsets.

BackboneMethodk=1k=1k=2k=2k=3k=3k=4k=4k=5k=5
LLaMA-3.1 8BTA0.4110.4430.4560.4620.469
LLaMA-3.2 3BTA0.3750.3860.3880.3890.388
Gemma-2 2BTA0.4920.5030.5060.5070.507
LLaMA-3.1 8BTIES0.3880.4140.4260.4360.436

The downstream trend mirrors the CE law: performance improves as more experts are merged, then saturates. LLaMA-3.1 8B with TA rises from 0.411 to 0.469 from k=1k=1 to k=5k=5, while LLaMA-3.2 3B shows a shallower tail and small fluctuations near the plateau. The appendix attributes those fluctuations to benchmark variance rather than a systematic collapse.

The detailed downstream tables cover five specialized experts: math, code, multilingual, safety, and instruction following. Benchmarks include MATH-500, GSM8K, MBPP+, HumanEval+, IFEval, ARC, HellaSwag, MMLU, multilingual overall, and safety.

Scaling With 16 Domains

The appendix extends the cross-domain experiment to a 16-domain pool on the LLaMA3-3B-Instruct backbone. It starts with the original nine domains and adds Japanese, medical, house-arrangement, Korean, emotion, elementary school mathematics, and Java code.

For k{2,4,6,8,10,12,14,16}k\in\{2,4,6,8,10,12,14,16\}, the paper samples random TA merges and evaluates CE, variance, and standard deviation across subsets. The macro-average keeps the same pattern:

Statistick=2k=2k=4k=4k=6k=6k=8k=8k=10k=10k=12k=12k=14k=14k=16k=16
Overall CE0.77740.73310.70510.68740.66850.66030.65090.6437
Overall variance0.00090.00170.00210.00180.00120.00060.0024
Overall std0.03100.04180.04610.04240.03570.02510.0156

The larger domain pool does not change the qualitative behavior: CE drops as more experts are merged, gains flatten at larger kk, and subset variability generally contracts.

Cross-Domain Synergy

The paper measures donor-receiver interactions by adding one expert at a time and recording the marginal CE change on every evaluation domain. This yields a 9×99\times9 synergy matrix SdeS_{d\to e}.

Cross-domain synergy heatmap under DARE at 32BTop positive and negative donor receiver pairs

DARE at 32B shows structured synergy: science helps science, math helps math, and cross-block transfers are weaker or mildly negative.

Under DARE at 32B, science-to-science pairs are strongly positive, math-to-math pairs are moderately positive, and cross-block interactions are weakly negative. Code mildly helps Discrete and Geometry. The strongest off-diagonal positives are Biology to Chemistry (+0.076), Physics to Biology (+0.074), Physics to Chemistry (+0.068), Chemistry to Biology (+0.066), and Biology to Physics (+0.054). The largest negatives include Algebra to Physics (-0.026), Geometry to Chemistry (-0.020), Discrete to Chemistry (-0.018), Algebra to Biology (-0.016), and Number Theory to Biology (-0.015).

For practical selection, the appendix suggests prioritizing Physics, Biology, and Chemistry donors for science targets, and staying within the math block or including Code for math targets.

Additional Order and Backbone Details

Across-order dispersion is measured from DARE merge permutations. The paper fits:

Stdorder(N,k)=c0(N)+c1(N)k+b\mathrm{Std}_{\text{order}}(N,k) = c_0(N)+\frac{c_1(N)}{k+b}

Order sensitivity drops rapidly from k=1k=1 to k=8k=8. For example, at 32B, mean CE changes from 0.5207 at k=1k=1 to 0.4634 at k=8k=8, while across-order std drops from 0.0313 to 0.0060 and the range drops from 0.0865 to 0.0148.

On LLaMA backbones, the appendix fits:

BackboneR2R^2bbLL_\inftyAAL(k=1)L(k=1)L(k=9)L(k=9)
LLaMA-3.2 3B0.99890.68750.71370.07830.75990.7221
LLaMA-3 8B0.99550.00000.72520.05730.78370.7325

This supports the claim that the same inverse-tail law transfers beyond Qwen2.5.

Additional Method-Size Slices

The appendix includes method comparisons across several fixed expert counts. These show the same compression of method gaps as model size and expert count increase.

Method comparison by model size at k=1Method comparison by model size at k=2Method comparison by model size at k=4Method comparison by model size at k=8

Method comparison slices at fixed k.

Impact Statement

This work advances understanding of model merging by characterizing how performance evolves with model size and expert count. By making expert merging more predictable, the scaling law can reduce unnecessary computation in large-scale model development. The techniques operate on trained models and do not introduce new objectives or data sources, so they inherit the same ethical considerations as existing LLMs.

More predictable merging may reduce the cost of deploying specialized capabilities, but it may also make powerful models more accessible. The paper does not identify negative ethical consequences unique to this method beyond those already associated with large-scale machine learning systems.

Limitations

The study centers on cross-entropy and equal-normalized composition. Extending the law to other objectives and adaptive weighting is an important next step.

The empirical evidence is broad across tested datasets, methods, and backbones, but it still does not cover extreme scales, additional modalities, or the full range of downstream metrics. Robustness, safety, and calibration remain open dimensions.

Expert capacity is controlled rather than treated as a third scaling axis. Changing LoRA rank, adapter width, training tokens, or expert quality should alter the effective-update statistics and therefore the fitted floor and tail parameters.

On the theoretical side, the paper leaves open a sharper connection between floor/tail parameters, curvature anisotropy, and domain dispersion. Better selection and ordering policies that exploit these quantities could tighten predictions and automate practical merging at scale.

BibTeX

@inproceedings{wang-2026-merging-scaling-law,
      title={Model Merging Scaling Laws in Large Language Models},
      author={Yuanyi Wang and Yanggan Gu and Yiming Zhang and Qi Zhou and Zhaoyi Yan and Congkai Xie and Xinyao Wang and Jianbo Yuan and Hongxia Yang},
      booktitle={Proceedings of the 43rd International Conference on Machine Learning},
      year={2026},
      url={https://github.com/InfiXAI/Merging-Scaling-Law},
}