PR #2018
openRecord: Gated XSA + LQER top-1 + strict in-timer n-gram TTT (val_bpb: 1.046)
by simon-marcusView on GitHub
val_bpb
1.0462
Architecture
Transformer
Optimizer
—
Artifact Size
15,996,490 bytes
Training Techniques
Architecture
XSA
Gated XSA with a learned per-head scalar gate multiplying the XSA subtraction coefficient via tanh(xsa_alpha).
parameters: {"gated":true}
LeakyReLU
Uses LeakyReLU 0.3 in the base stack.
parameters: {"slope":0.3}
Quantization
GPTQ-lite
bits: null
scope: model artifact
Test-Time Training
score-first TTT
parameters: {"phased":true,"phases":1,"prefix_docs":1000}
LoRA TTT
parameters: {"rank":80,"local_lr_mult":0.75,"mask":"no_qv"}
Evaluation
n-gram tilt
parameters: {"precompute_inside_timer":true}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asym_logit_rescale":true}
Other
other
LQER top-1 keeps only the best LQER correction tensor to reduce artifact size.
parameters: {"lqer_rank":4,"top_k":1,"asymmetric":true}
other
CaseOps SP8192 lossless-caps tokenizer and byte-sidecar dataset setup.
parameters: {"vocab_size":8192}
Novel Contributions
- Gated XSA with a learned per-head gate on the XSA subtraction coefficient
- LQER top-1 to reduce artifact size while preserving the best correction tensor
- Strict in-timer n-gram tilt with hint precompute counted inside evaluation time
- Cheaper phased score-first TTT with a 1,000-document prefix
- CaseOps SP8192 lossless-caps tokenizer and byte-sidecar setup