val_bpb
1.1537
Architecture
Transformer
Optimizer
Muon
Artifact Size
15331125 bytes
Training Techniques
Architecture
BigramHash
Adds bigram token-pair features on the input path.
parameters: {"bigram_vocab_size":4096,"bigram_dim":128}
SmearGate
Blends each token representation with the previous token to smooth inputs.
parameters: null
Quantization
mixed int6
bits: 6
scope: large MLP and attention matrices
Weight Averaging
SWA
parameters: {"start_frac":0.5,"every":200}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"adam_weight_decay":0.01,"muon_weight_decay":0.02}
Initialization
OrthoInit
Orthogonal initialization used for the model.
Other
other
Fixed the sliding-window evaluator to avoid rescoring overlapping tail tokens in truncated windows.
parameters: null
Novel Contributions
- Adds BigramHash token-pair features to the input path
- Introduces SmearGate input smoothing
- Uses mixed int6 export for large attention and MLP matrices
- Applies SWA over the late low-learning-rate phase
- Uses Muon with tuned momentum and weight decay
- Fixes the sliding-window evaluation bug that previously double-counted tail tokens
- Updates the canonical metric using an exact reevaluation of the saved seed=1337 checkpoint