PR #230
openRecord: Int6 + MLP 3x + NorMuon + SmearGate + BigramHash + OrthoInit + Sliding Window, val_bpb=1.1541
by MatthewHRockwellView on GitHub
val_bpb
1.1541
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,992,610 bytes
Training Techniques
Quantization
int6
bits: 6
scope: per-row weights; tied embeddings kept in fp16
Architecture
MLP3x
Expanded MLP hidden size to 3x model dimension to increase capacity.
parameters: {"hidden_dim":1536,"multiplier":3}
SmearGate
Learned gate blending each token embedding with the previous token embedding.
parameters: {"params":512}
BigramHash
Hash-based embedding for token pairs to inject explicit bigram context.
parameters: {"buckets":4096,"dimension":64,"projected_dim":512}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"decoupled_weight_decay":true,"normalized_newton_schulz":true}
Initialization
OrthoInit
Orthogonal initialization with muP-style output projection scaling by 1/sqrt(2L).
Evaluation
sliding window eval
parameters: {"stride":256,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.02,"decoupled":true}
Novel Contributions
- Int6 per-row quantization with fp16 scales and fp16 tied embeddings
- MLP hidden expansion to 3x model dimension enabled by quantization savings
- NorMuon / normalized Newton-Schulz optimization with decoupled weight decay
- SmearGate token blending with previous-token context
- BigramHash embedding for token pairs
- Orthogonal initialization with muP-scaled output projections
- Sliding window evaluation with stride 256 over 2048-token windows