PR #1841

open

Non-Record: Information-Theoretic Decorrelation Regularizer (val_bpb=1.1413)

val_bpb

1.1413

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8 MB

Training Techniques

Regularization

correlation decorrelation regularizer

parameters: {"lambda":0.01,"penalty":"mean squared off-diagonal correlations"}

weight decay

parameters: {"adam_wd":0.04,"muon_wd":0.04}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Weight Averaging

SWA

parameters: {"every":200,"frac":0.5}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Skip first encoder layer and last decoder layer when applying the regularizer so only middle layers contribute.

parameters: {"skip_first_layer":true,"skip_last_layer":true}

Information-theoretic decorrelation regularizer motivated by Partial Information Decomposition theory
Uses a lightweight correlation Frobenius proxy for total correlation during training
Applies the regularizer to hidden-state correlation matrices to encourage statistical independence across hidden dimensions
Targets improved post-training quantization compressibility while staying under the 16MB artifact cap
Shows matched-step convergence gains despite small throughput overhead