PR #1350

open

Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean)

by resouerView on GitHub
val_bpb
1.0046
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.8 MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"training":"used for base training"}
L-BFGS
weight_decay: null
momentum: null
other_params: {"max_iter":25,"history":20,"line_search":"strong_wolfe","space":"logit","warm_start":true,"delta_clamp":5,"focal_loss_last_tokens":128,"causal":true}
Test-Time Training
AdamW TTT
parameters: {"epochs":6,"freeze_first_blocks":2}
Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
BigramHash
BigramHash 2048x128 used in the base model stack
parameters: {"dimensions":"2048x128"}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Regularization
weight decay
parameters: {"damp":0.005}

Novel Contributions

  • Causal L-BFGS SLOT in logit space
  • Optimization restricted to already-scored context positions only
  • Provably causal SLOT mechanism that avoids future-token leakage
  • Pre-quant AdamW test-time training before GPTQ
  • Coprime-stride multi-shard data loader
  • Configuration tuning with QK_GAIN, warmdown, and GPTQ damping