PR #1771

open

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean)

val_bpb
1.0651
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.98 MB

Training Techniques

Architecture
depth recurrence
Phased recurrence-depth curriculum with evaluation at depth 4.
parameters: {"layers":11,"depths":[1,3,4]}
SmearGate
Per-layer learned smoothing gate blending adjacent token representations.
parameters: null
Gated Attention
Full-dimension attention output gating with QuantGate passthrough.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input embeddings and output embeddings.
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP
GPTQ
bits: 7
scope: embeddings
Test-Time Training
LoRA TTT
parameters: {"alpha":144,"rank":96,"weight_decay":1,"warm_start_A":true,"phased":true,"prefix_docs":2000,"num_phases":3}
Optimizer
AdamW
weight_decay: 1
momentum: null
other_params: {"learning_rate":0.0001}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null

Novel Contributions

  • SP8192 CaseOps reversible case normalization tokenizer
  • Recurrence depth curriculum with prewarmed depths 3 and 4
  • SmearGate combined with Gated Attention
  • QuantGate-enabled GPTQ passthrough for gates
  • LoRA TTT improvements including alpha/rank scaling and warm-start A
  • Phased score-first TTT without rescoring