val_bpb
1.2167
Architecture
Transformer
Optimizer
Muon
Artifact Size
16 MB
Training Techniques
Architecture
energy refinement loop
Adds a learnable quadratic energy over the final hidden state and performs K analytic gradient-descent refinement steps before projecting to logits.
parameters: {"K_train":8,"K_eval":8,"rank":32}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_params":"A.weight"}
Regularization
label smoothing
parameters: {"auxiliary_ce_on_h0":true,"aux_loss_weight":0.3}
weight decay
parameters: null
Other
other
Optional Gaussian noise added to the initial hidden state during training.
parameters: {"h0_noise_std":0.05}
Quantization
int8
bits: 8
scope: post-training weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Novel Contributions
- Energy-Based Transformer variant with a learnable quadratic energy on the final hidden state
- Closed-form analytic refinement steps using A^T A h + b instead of autograd-based inner optimization
- Auxiliary CE on the initial hidden state to reduce collapse
- Diagnostic evaluation of iso-step, compute-trade, and K_eval ablations
- Identification of a DDP find_unused_parameters overhead trap that caused a false-positive baseline slowdown