PR #1771
openRecord: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0651
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.98 MB
Training Techniques
Architecture
depth recurrence
Phased recurrence-depth curriculum with evaluation at depth 4.
parameters: {"layers":11,"depths":[1,3,4]}
SmearGate
Per-layer learned smoothing gate blending adjacent token representations.
parameters: null
Gated Attention
Full-dimension attention output gating with QuantGate passthrough.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input embeddings and output embeddings.
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP
GPTQ
bits: 7
scope: embeddings
Test-Time Training
LoRA TTT
parameters: {"alpha":144,"rank":96,"weight_decay":1,"warm_start_A":true,"phased":true,"prefix_docs":2000,"num_phases":3}
Optimizer
AdamW
weight_decay: 1
momentum: null
other_params: {"learning_rate":0.0001}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Novel Contributions
- SP8192 CaseOps reversible case normalization tokenizer
- Recurrence depth curriculum with prewarmed depths 3 and 4
- SmearGate combined with Gated Attention
- QuantGate-enabled GPTQ passthrough for gates
- LoRA TTT improvements including alpha/rank scaling and warm-start A
- Phased score-first TTT without rescoring