PR #1798

closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287

by leon2k2k2kView on GitHub
val_bpb
1.0629
Architecture
Transformer
Optimizer
Artifact Size
~15.91 MB

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"phased":true,"score_first":true}
Architecture
Gated Attention
Replaces dense attention output gating with a narrow-input sparse gate; includes learned per-head gate scalars and quant-time gate scaling.
parameters: {"init_std":0.005,"quant_gate":true}
depth recurrence
Loop4-5 style repeated execution of layers 4 and 5.
parameters: {"layers":[4,5]}
weight tying
CaseOps tokenizer with shared vocabulary/operator tokens is mentioned, but no explicit model weight tying is described.
parameters: null
Quantization
GPTQ
bits: null
scope: model weights
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Regularization
logit softcap
parameters: {"fused_softcapped_ce":true}
Other
other
CaseOps reversible case preprocessing with operator tokens and per-token byte sidecar accounting for BPB on original bytes.
parameters: {"tokenizer":"CaseOps","byte_sidecar":true}

Novel Contributions

  • Sparse attention-output gate replacing the dense gated attention path
  • Updated frozen recurrent carry with re-learned and 2-decimal-quantized alpha/beta coefficients
  • Phased score-first LoRA TTT stack
  • CaseOps reversible tokenizer and original-byte BPB accounting
  • Quant-time gate scaling to recover artifact overhead