PR #2063

open

[Non-record] EBT - Energy-Based Transformer

val_bpb

1.2167

Architecture

Transformer

Optimizer

Muon

Artifact Size

16 MB

Training Techniques

Architecture

energy refinement loop

Adds a learnable quadratic energy over the final hidden state and performs K analytic gradient-descent refinement steps before projecting to logits.

parameters: {"K_train":8,"K_eval":8,"rank":32}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_params":"A.weight"}

Regularization

label smoothing

parameters: {"auxiliary_ce_on_h0":true,"aux_loss_weight":0.3}

weight decay

parameters: null

Other

other

Optional Gaussian noise added to the initial hidden state during training.

parameters: {"h0_noise_std":0.05}

Quantization

int8

bits: 8

scope: post-training weights

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Energy-Based Transformer variant with a learnable quadratic energy on the final hidden state
Closed-form analytic refinement steps using A^T A h + b instead of autograd-based inner optimization
Auxiliary CE on the initial hidden state to reduce collapse
Diagnostic evaluation of iso-step, compute-trade, and K_eval ablations
Identification of a DDP find_unused_parameters overhead trap that caused a false-positive baseline slowdown