PR #205
openMetaStack v3: 1.1792 sliding bpb, 10L BigramHash SmearGate OrthoInit SWA
by xinpw8View on GitHub
val_bpb
1.1792
Architecture
GPT
Optimizer
Muon
Artifact Size
12.1MB
Training Techniques
Architecture
BigramHash
BigramHash embeddings used in the GPT model.
parameters: null
SmearGate
SmearGate mechanism added to the model.
parameters: null
Initialization
OrthoInit
Orthogonal initialization for model weights.
Weight Averaging
SWA
parameters: {"checkpoints":30}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Quantization
mixed int5/int6
bits: null
scope: MLP/int5, attention/int6
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"mixed_precision_pruning":"2% magnitude pruning"}
Other
other
2% magnitude pruning applied to the model.
parameters: {"pruning_rate":0.02}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Compression
zstd
level: 22
Novel Contributions
- 10-layer GPT with BigramHash embeddings
- SmearGate architecture component
- OrthoInit initialization
- SWA over 30 checkpoints
- Muon optimizer with decoupled weight decay
- Mixed int5/int6 quantization
- 2% magnitude pruning
- Sliding-window evaluation with stride 64
- Search harness and deployment/monitoring pipeline