PR #1539
openRecord: SP8192 + Pre-Quant AdamW TTT + Compiled TTT — val_bpb 1.0587 (3-seed mean)
by translatingthenameView on GitHub
val_bpb
1.0587
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.5 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer recurrence with repeated layers to create virtual depth
parameters: {"layers":3,"virtual_layers":14,"physical_layers":11}
Parallel Residuals
GPT-J style two-lane residual path where attention and MLP operate independently and are merged later
parameters: {"start_layer":7}
XSA
XSA applied across all layers for efficient GQA-aware attention
parameters: {"layers":11}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input and output embeddings
parameters: null
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"negative_slope":0.5}
SmearGate
SmearGate mechanism included in the architecture
parameters: null
U-Net skip connections
Sigmoid-gated U-Net style skip connections
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Value Embeddings
Value embeddings added to the model
parameters: {"dimension":44,"layers":[9,10]}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":4}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"used_for":"embeddings and scalars","embedding_lr":0.03,"scalar_lr":0.02}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
Brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":6,"learning_rate":0.0005,"freeze_blocks":2,"compiled":true}
LR Schedule
cosine decay
parameters: {"final_lr_factor":0.1}
warmdown
parameters: {"warmdown_fraction":0.72}
Regularization
weight decay
parameters: {"value":0.095}
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}
Novel Contributions
- Pre-quant AdamW test-time training baked into the final artifact
- Compiled TTT with torch.compile for faster validation fine-tuning
- SP8192 with GPTQ SDClip quantization using mixed int6/int8 precision
- 3-layer depth recurrence producing 14 virtual layers from 11 physical layers
- Parallel residual architecture with GPT-J style two-lane merging
- Combined MuonEq-R training with EMA, warmdown, and tuned QK gain