PR #2040

open

Non-record 10min/16MB: PMI-Spine Local-Global Muon (val_bpb 1.25425)

by FF-GardenFnView on GitHub
val_bpb
1.2543
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,895,414 bytes

Training Techniques

Architecture
weight tying
Shared/tied embeddings in the bifurcated local-global language model.
parameters: null
GQA
Global recurrent summary path uses grouped query attention.
parameters: null
sliding window attention
Local path uses full-resolution sliding-window attention with chunked lookback.
parameters: null
BigramHash
Uses bigram prior components loaded at startup for the PMI-anchored spine path.
parameters: null
TrigramHash
Uses trigram CP prior components loaded at startup for the PMI-anchored spine path.
parameters: null
Quantization
int4
bits: 4
scope: model export
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"row_sharded_local_muon":true,"stabilized_gram_ns_backend":true,"fifth_order_polar_express_coefficients":true,"safety_factor":1.05,"same_shape_batching":true,"adamw_for_embeddings_heads_scalars_aux":true}
Other
other
PMI/SPINE basis projection adds a learned per-token channel anchored to a directional shifted-PMI low-rank basis.
parameters: null
other
Lag-interferometer residuals provide a causal pairwise past-lag mixer before local QKV.
parameters: null
other
Pointer-generator copy head merges into the logit projection.
parameters: null

Novel Contributions

  • Bifurcated shared-stem local-global language model
  • PMI/SPINE basis projection with directional shifted-PMI low-rank basis
  • Lag-interferometer residuals for causal past-lag mixing
  • Pointer-copy logit merge
  • Int4 QAT export with zlib roundtrip under 16MB
  • Muon-based optimizer taxonomy with row-sharded local Muon and stabilized Gram-NS backend