PR #637
openNon-record submission: BigramDim160 + 10% Prune + SWA (1.14767 bpb, 2 seeds)
by bryjudyView on GitHub
val_bpb
1.1477
Architecture
Transformer
Optimizer
Muon + Adam
Artifact Size
approximately 15.8MB to 15.9MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP (int5/int6), attention (int6)
Architecture
BigramHash
Bigram embedding with reduced dimension to control artifact size
parameters: {"dim":160,"vocab_buckets":10240}
SmearGate
Replaces standard LayerNorm gating
parameters: null
OrthoInit
Orthogonal initialization for better initial weight structure
parameters: null
GQA
Grouped Query Attention reduces parameter count while maintaining quality
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
SWA
parameters: {"start_frac":0.5,"checkpoints_averaged":23}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon + Adam
weight_decay: 0.04
momentum: null
other_params: {"embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}
Regularization
weight pruning
parameters: {"amount":"10%","scope":"non-embedding linear weights","timing":"post-SWA, pre-quantization"}
Novel Contributions
- Reduced BigramHash embedding dimension from 192 to 160 to reliably fit artifact size under 16MB across seeds
- Applied 10% weight pruning on non-embedding linear weights post-SWA to improve compressibility without hurting quality
- Demonstrated artifact size variance between seeds as a key challenge, emphasizing reliability over raw quality
- Used SWA starting halfway through training (start_frac=0.5) averaging 23 checkpoints
- Maintained SOTA techniques like SmearGate, OrthoInit, GQA, and mixed int6/int5 quantization with zstd compression