PR #1812
openRecords: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)
by EthanNingView on GitHub
val_bpb
1.0729
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.00 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Layer recurrence over a subset of layers during training.
parameters: {"layers":[3,5],"num_loops":2}
parallel residuals
Parallel residual pathway introduced from later layers.
parameters: {"start_layer":7}
XSA
Exclusive self-attention subtracting normalized-V projection of the output.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"negative_slope":0.5}
Gated Attention
Per-head attention-output sigmoid gate.
parameters: {"gate_width":12}
Regularization
weight decay
parameters: {"mlp":0.115,"attn":0.095}
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Compression
lzma
level: null
brotli
level: 11
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"row_normalized":true,"newton_schulz_steps":5,"nesterov":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"causal":true}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.005,"chunk_size":32000,"momentum":0.9,"nesterov":true}
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
cosine decay
parameters: {"applied_to":"TTT per-chunk LR"}
Sequence Length
sequence_length
train_length: 32000
eval_length: 32000
Novel Contributions
- Score-first legal test-time training with 4 epochs per chunk
- Split MLP weight decay with stronger regularization on MLP matrices
- Per-head attention-output gating
- Continuation of the SP8192 + depth recurrence + parallel residuals + QK-Gain stack
- GPTQ SDClip quantization with byte-shuffle and Brotli compression