PR #2018

open

Record: Gated XSA + LQER top-1 + strict in-timer n-gram TTT (val_bpb: 1.046)

by simon-marcusView on GitHub
val_bpb
1.0462
Architecture
Transformer
Optimizer
Artifact Size
15,996,490 bytes

Training Techniques

Architecture
XSA
Gated XSA with a learned per-head scalar gate multiplying the XSA subtraction coefficient via tanh(xsa_alpha).
parameters: {"gated":true}
LeakyReLU
Uses LeakyReLU 0.3 in the base stack.
parameters: {"slope":0.3}
Quantization
GPTQ-lite
bits: null
scope: model artifact
Test-Time Training
score-first TTT
parameters: {"phased":true,"phases":1,"prefix_docs":1000}
LoRA TTT
parameters: {"rank":80,"local_lr_mult":0.75,"mask":"no_qv"}
Evaluation
n-gram tilt
parameters: {"precompute_inside_timer":true}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asym_logit_rescale":true}
Other
other
LQER top-1 keeps only the best LQER correction tensor to reduce artifact size.
parameters: {"lqer_rank":4,"top_k":1,"asymmetric":true}
other
CaseOps SP8192 lossless-caps tokenizer and byte-sidecar dataset setup.
parameters: {"vocab_size":8192}

Novel Contributions

  • Gated XSA with a learned per-head gate on the XSA subtraction coefficient
  • LQER top-1 to reduce artifact size while preserving the best correction tensor
  • Strict in-timer n-gram tilt with hint precompute counted inside evaluation time
  • Cheaper phased score-first TTT with a 1,000-document prefix
  • CaseOps SP8192 lossless-caps tokenizer and byte-sidecar setup