PR #156
openfeat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)
by dexhunterView on GitHub
val_bpb
1.1602
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15,045,740 bytes
Training Techniques
Quantization
int6
bits: 6
scope: per-row weights; embeddings kept fp16
Architecture
MLP3x
3x wider MLP hidden layer to increase capacity within the artifact budget
parameters: {"dimensions":1536}
tied embeddings
Embedding tensor is tied and stored in fp16, never quantized
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"layers":9,"model_dim":512,"attention_heads":8,"kv_heads":4}
RoPE
Uses RoPE positional encoding with learnable Q gain
parameters: {"q_gain_init":1.5}
Optimizer
NorMuon
weight_decay: null
momentum: 0.99
other_params: {"beta2":0.95,"matrix_lr":0.02,"warmdown_iters":3000,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Weight Averaging
SWA
parameters: {"checkpoints":7,"interval_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":960}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Straight-through estimator with fake int6 per-row quantization applied on every forward pass during training
parameters: {"range":[-31,31]}
other
U-Net style skip connections with learnable per-layer per-dimension skip weights
parameters: null
Novel Contributions
- Int6 STE fake quantization during training with straight-through gradient bypass
- NorMuon optimizer with row-normalized Newton-Schulz updates
- 3x wider MLP enabled by int6 compression savings
- FP16 tied embedding passthrough to protect quantization-sensitive weights
- Sliding window evaluation with stride 64 for longer effective context
- SWA over 7 checkpoints during warmdown
- Zstd-22 artifact compression
- U-Net skip connections with learnable skip weights