PR #1840

closed

Non-Record: Information-Theoretic Decorrelation Regularizer (val_bpb=1.1413)

by gketronDSView on GitHub
val_bpb
1.1413
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8 MB

Training Techniques

Regularization
correlation decorrelation regularizer
parameters: {"lambda":0.01,"penalizes":"mean squared off-diagonal correlations in hidden-state correlation matrices"}
weight decay
parameters: {"matrix_wd":0.04,"adam_wd":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Weight Averaging
SWA
parameters: {"frac":0.5,"every":200}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}

Novel Contributions

  • Information-theoretic correlation decorrelation regularizer inspired by Partial Information Decomposition theory
  • Uses a lightweight Frobenius-norm proxy for total correlation by penalizing off-diagonal hidden-state correlations
  • Applies the regularizer selectively to middle layers, skipping input/output boundary layers
  • Demonstrates near-identical BPB to the baseline while showing better matched-step convergence after mid-training