PR #1714
openSP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.5 + SGD-TTT [LoRA-TTT Future Work]
by AnakintanoView on GitHub
val_bpb
1.0857
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB
Training Techniques
Architecture
depth recurrence
Layers 3, 4, and 5 are looped twice, creating 17 effective layers from 11 physical layers.
parameters: {"layers":[3,4,5],"loops":2}
U-Net skip connections
Skip gates provide U-Net style connections between layers.
parameters: null
GQA
Grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
weight tying
Input and output embeddings are tied.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: weight matrices
GPTQ
bits: 8
scope: embeddings
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"row_normalized":true}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"used_for":"scalars/embeddings"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Test-Time Training
SGD TTT
parameters: {"epochs_per_chunk":3,"momentum":0.9,"score_before_update":true}
LoRA TTT
parameters: {"rank_qv":4,"rank_mlp_gate":2,"epochs_per_chunk":12,"frozen_base":true}
LR Schedule
cosine decay
parameters: {"warmup_steps":20,"warmdown_fraction":0.72}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Novel Contributions
- LoRA-TTT with frozen base model and low-rank adapters updated during test-time training
- Recur-Alpha learned carry scalar for recurrent blocks
- QK-Gain 5.5 per-head query scaling
- 3-layer depth recurrence with parallel residuals
- SGD-TTT fallback with score-before-update compliance