PR #1841
openNon-Record: Information-Theoretic Decorrelation Regularizer (val_bpb=1.1413)
by gketronDSView on GitHub
val_bpb
1.1413
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8 MB
Training Techniques
Regularization
correlation decorrelation regularizer
parameters: {"lambda":0.01,"penalty":"mean squared off-diagonal correlations"}
weight decay
parameters: {"adam_wd":0.04,"muon_wd":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Weight Averaging
SWA
parameters: {"every":200,"frac":0.5}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Skip first encoder layer and last decoder layer when applying the regularizer so only middle layers contribute.
parameters: {"skip_first_layer":true,"skip_last_layer":true}
Novel Contributions
- Information-theoretic decorrelation regularizer motivated by Partial Information Decomposition theory
- Uses a lightweight correlation Frobenius proxy for total correlation during training
- Applies the regularizer to hidden-state correlation matrices to encourage statistical independence across hidden dimensions
- Targets improved post-training quantization compressibility while staying under the 16MB artifact cap
- Shows matched-step convergence gains despite small throughput overhead