PR #562
openNon-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash
by bigbagView on GitHub
val_bpb
1.1354
Architecture
Transformer
Optimizer
Muon (matrices) + AdamW (embeddings/scalars)
Artifact Size
15.35 MB
Training Techniques
Architecture
Value Residual
ResFormer-style layer-0 V mixing
parameters: null
Gated Attention
per-head sigmoid gates
parameters: null
XSA
cross self-attention on last 4 layers
parameters: {"layers":4}
LeakyReLU(0.5)²
activation preserving negative gradient flow, improves BPB by -0.003
parameters: {"negative_slope":0.5}
TrigramHash
extends BigramHash to 3-token context via XOR hashing into shared embedding table
parameters: null
SmearGate
additional gating mechanism
parameters: null
LN Scale
depth-scaled residuals
parameters: null
U-Net skip connections
skip connections inspired by U-Net architecture
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz":"used for matrices"}
AdamW
weight_decay: 0
momentum: null
other_params: {"used_for":"embeddings/scalars","TTT_lr":0.0005}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":27}
Quantization
mixed int5 (MLP) / int6 (attention) + GPTQ-lite per-row clip search + 3% magnitude pruning + FP16 passthrough for embeddings + zstd-22 compression
bits: null
scope: MLP, attention, embeddings
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: {"epochs":22,"optimizer":"AdamW","learning_rate":0.0005,"weight_decay":0,"lr_schedule":"per-step cosine decay to 0","per_layer_lr_groups":{"output_projections":3,"input_projections":0.5},"batch_size_per_gpu":32,"gradient_sync":"all_reduce per step","gradient_clipping":1,"TTT_time_seconds":406,"eval_time_seconds":197}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
cosine decay
parameters: {"per_step":true,"decay_to":0}
Evaluation
sliding window eval + Test-Time Training (TTT)
parameters: {"TTT_epochs":22,"TTT_batch_size":32,"distributed_sync":"all_reduce per step"}
Novel Contributions
- Batched TTT with 32 sequences per GPU is ~500x faster than chunk-based TTT
- Per-step cosine learning rate decay prevents overfitting at high epoch counts during TTT
- Gradient synchronization per step (all_reduce on gradients) is critical for stable multi-GPU TTT
- Per-layer learning rate groups compensate for uneven quantization damage, especially on output projections
- LeakyReLU(0.5)² activation improves BPB by -0.003 compared to ReLU²
- TrigramHash extends BigramHash context from 2 to 3 tokens using a shared embedding table with zero extra parameters