PR #205

open

MetaStack v3: 1.1792 sliding bpb, 10L BigramHash SmearGate OrthoInit SWA

val_bpb
1.1792
Architecture
GPT
Optimizer
Muon
Artifact Size
12.1MB

Training Techniques

Architecture
BigramHash
BigramHash embeddings used in the GPT model.
parameters: null
SmearGate
SmearGate mechanism added to the model.
parameters: null
Initialization
OrthoInit
Orthogonal initialization for model weights.
Weight Averaging
SWA
parameters: {"checkpoints":30}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Quantization
mixed int5/int6
bits: null
scope: MLP/int5, attention/int6
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"mixed_precision_pruning":"2% magnitude pruning"}
Other
other
2% magnitude pruning applied to the model.
parameters: {"pruning_rate":0.02}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Compression
zstd
level: 22

Novel Contributions

  • 10-layer GPT with BigramHash embeddings
  • SmearGate architecture component
  • OrthoInit initialization
  • SWA over 30 checkpoints
  • Muon optimizer with decoupled weight decay
  • Mixed int5/int6 quantization
  • 2% magnitude pruning
  • Sliding-window evaluation with stride 64
  • Search harness and deployment/monitoring pipeline