PR #1766

open

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT

by tashapaisView on GitHub
val_bpb
1.0655
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~16MB

Training Techniques

Architecture
depth recurrence
3-layer looped recurrence over layers 3, 4, 5 with 2 loops, creating 17 virtual layers and activating at 35% training.
parameters: {"layers":3,"loops":2,"virtual_layers":17,"activate_at":35}
Gated Attention
Per-head sigmoid output gate for attention.
parameters: {"init_std":0.01}
CaseOps
Bijective lossless case preprocessing with operator tokens and byte sidecar for BPB on original UTF-8 bytes.
parameters: null
Parallel Residuals
GPT-J style parallel residual connections from layer 8.
parameters: {"start_layer":8}
QK-Gain
Learned per-head query scalar gain.
parameters: {"gain":5}
Recur-Alpha
Learned scalar carry per looped block that adds a weighted copy of the first-visit activation to later recurrence passes.
parameters: {"init":0,"scalars":3}
Quantization
GPTQ
bits: 6
scope: matrices; embeddings int8
int8
bits: 8
scope: attn_gate_w per-row
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"row_normalized":true,"combined_with":"AdamW"}
Test-Time Training
LoRA TTT
parameters: {"mode":"score-first","reset_per_doc":true,"lr_decay":"cosine"}
Compression
brotli
level: null
LR Schedule
cosine decay
parameters: {"phase":"TTT"}

Novel Contributions

  • Recur-Alpha: a learned scalar carry added to recurrent looped blocks
  • First composition of Recur-Alpha with the CaseOps + phased TTT stack
  • 3-scalar carry mechanism across the looped blocks in depth recurrence
  • Byte-shuffle + Brotli compression