PR #1840

closed

Non-Record: Information-Theoretic Decorrelation Regularizer (val_bpb=1.1413)

val_bpb

1.1413

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8 MB

Training Techniques

Regularization

correlation decorrelation regularizer

parameters: {"lambda":0.01,"penalizes":"mean squared off-diagonal correlations in hidden-state correlation matrices"}

weight decay

parameters: {"matrix_wd":0.04,"adam_wd":0.04}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Weight Averaging

SWA

parameters: {"frac":0.5,"every":200}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Information-theoretic correlation decorrelation regularizer inspired by Partial Information Decomposition theory
Uses a lightweight Frobenius-norm proxy for total correlation by penalizing off-diagonal hidden-state correlations
Applies the regularizer selectively to middle layers, skipping input/output boundary layers
Demonstrates near-identical BPB to the baseline while showing better matched-step convergence after mid-training