PR #462
closedRecord: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)
by JoeProAIView on GitHub
val_bpb
1.0672
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
SwiGLU FFN
Feed-forward network uses SwiGLU with Star-ReLU activation.
parameters: null
U-Net
U-Net-style skip connections with learned gating.
parameters: null
BigramHash
BigramHash embeddings for token representation.
parameters: {"buckets":8192,"dimension":128}
SmearGate
SmearGate applied on embeddings.
parameters: null
Partial RoPE
Rotary positional embeddings applied only partially.
parameters: {"dimensions":16}
KV head count
Uses 8 attention heads and 8 KV heads.
parameters: {"heads":8,"kv_heads":8}
weight tying
Tied embeddings.
parameters: null
XSA
Cross-sequence attention on the last 4 layers.
parameters: {"layers":4}
Weight Averaging
EMA
parameters: {"decay":0.9985}
Test-Time Training
AdamW TTT
parameters: {"learning_rate":0.0005,"epochs":10,"weight_decay":0}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Quantization
int6
bits: 6
scope: all
Late QAT
bits: null
scope: all
LR Schedule
warmdown
parameters: {"steps":6000}
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
Novel Contributions
- SwiGLU FFN with Star-ReLU activation
- U-Net skip connections with learned gating
- BigramHash embeddings
- SmearGate on embeddings
- GEPA-discovered architecture search result
- Combination of XSA4, EMA, AdamW TTT, Partial RoPE, LN Scale, and Late QAT
- Int6 quantization with zstd-22 compression