PR #201
openLAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)
by machdragonView on GitHub
val_bpb
1.1551
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.7 MB
Training Techniques
Quantization
int6
bits: 6
scope: MLP and attention weights; int8 embeddings
Architecture
MLP3x
Expanded MLP hidden size to 3x
parameters: {"hidden":1536}
SmearGate
Added SmearGate module
parameters: null
BigramHash
Added BigramHash embedding component
parameters: {"vocab_size":2048,"dim":128}
GQA
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
U-Net skips
Added U-Net style skip connections across layers
parameters: null
RoPE
Used NTK-aware rotary positional embeddings
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.995,"dtype":"float32","update_frequency":"every_step"}
Initialization
Overtone init
SVD-based power-law embedding spectrum initialization for smoother int6 quantization
OrthoInit
Orthogonal initialization used on large matrices
Evaluation
sliding window eval
parameters: {"stride":64}
partial-window fix
parameters: {"only_full_windows":true}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"grad_clip":0.3,"warmdown_iters":1200}
Regularization
weight decay
parameters: {"value":0.04}
Compression
zstd
level: 22
Other
other
Fixed BigramHashEmbedding.proj zero-init override bug
parameters: null
Novel Contributions
- LAWA-EMA replacing SWA with every-step exponential moving average
- Overtone initialization using SVD power-law embedding spectrum
- BigramHashEmbedding projection zero-init fix
- Sliding window evaluation boundary fix
- Int6 quantized submission with reduced artifact size