val_bpb
1.1248
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Quantization
int6
bits: 6
scope: all
QAT
bits: null
scope: all
Architecture
SmearGate
Custom gating mechanism used in the model.
parameters: null
BigramHash
Bigram hashing embedding/vocabulary mechanism.
parameters: {"vocab_size":2048}
MLP3x
Three-layer MLP blocks.
parameters: {"layers":3}
XSA
XSA applied to the last layers of the model.
parameters: {"last_n_layers":4}
RoPE
Rotary positional embeddings with NTK scaling and partial application.
parameters: {"sequence_length":2048}
Partial RoPE
Applies RoPE to only a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
tied embeddings
Uses tied embeddings / value embeddings.
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
online logit bias
parameters: {"learning_rate":0.1,"momentum":0.9}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Online learned logit bias vector updated during validation to correct logits with exact cross-entropy gradient.
parameters: {"olb_lr":0.1,"olb_momentum":0.9}
Novel Contributions
- Online logit bias (OLB) learned during sliding window evaluation
- Exact cross-entropy gradient update for the bias vector
- Zero-parameter, near-zero-compute test-time correction
- Combination of QAT, TTT-style evaluation adaptation, and value/tied embeddings