PR #2040

open

Non-record 10min/16MB: PMI-Spine Local-Global Muon (val_bpb 1.25425)

by FF-GardenFnView on GitHub

val_bpb

1.2543

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,895,414 bytes

Training Techniques

Architecture

weight tying

Shared/tied embeddings in the bifurcated local-global language model.

parameters: null

GQA

Global recurrent summary path uses grouped query attention.

parameters: null

sliding window attention

Local path uses full-resolution sliding-window attention with chunked lookback.

parameters: null

BigramHash

Uses bigram prior components loaded at startup for the PMI-anchored spine path.

parameters: null

TrigramHash

Uses trigram CP prior components loaded at startup for the PMI-anchored spine path.

parameters: null

Quantization

int4

bits: 4

scope: model export

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"row_sharded_local_muon":true,"stabilized_gram_ns_backend":true,"fifth_order_polar_express_coefficients":true,"safety_factor":1.05,"same_shape_batching":true,"adamw_for_embeddings_heads_scalars_aux":true}

Other

other

PMI/SPINE basis projection adds a learned per-token channel anchored to a directional shifted-PMI low-rank basis.

parameters: null

other

Lag-interferometer residuals provide a causal pairwise past-lag mixer before local QKV.

parameters: null

other

Pointer-generator copy head merges into the logit projection.

parameters: null

Novel Contributions

Bifurcated shared-stem local-global language model
PMI/SPINE basis projection with directional shifted-PMI low-rank basis
Lag-interferometer residuals for causal past-lag mixing
Pointer-copy logit merge
Int4 QAT export with zlib roundtrip under 16MB
Muon-based optimizer taxonomy with row-sharded local Muon and stabilized Gram-NS backend