PR #162
RECORDclosedRecord: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)
by raahilshahView on GitHub
val_bpb
1.1458
Architecture
GPT
Optimizer
Muon
Artifact Size
15.86MB
Training Techniques
Quantization
int6
bits: 6
scope: MLP and attention weights; fp16 passthrough for tied embeddings and last-layer key projection
Architecture
MLP3x
Increased MLP hidden size from 2x to 3x expansion to improve capacity.
parameters: {"hidden":1536}
SmearGate
Learned gate blending each token embedding with the previous token embedding to add lightweight bigram context.
parameters: {"params":512}
BigramHash
Hash-based bigram embedding table for adjacent token-pair context.
parameters: {"vocab_size":4096,"dim":128}
Initialization
OrthoInit
Orthogonal initialization for large weight matrices with muP-style output scaling.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"adamw_weight_decay":0.01}
Weight Averaging
SWA
parameters: {"start_frac":0.5,"every_steps":50}
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}
Novel Contributions
- Per-row int6 quantization of MLP and attention weights with fp16 passthrough for sensitive components
- 3x MLP expansion enabled by int6 byte savings
- SmearGate for blending current and previous token embeddings
- BigramHash embedding for token-pair context
- Orthogonal initialization with muP-style scaling
- Muon optimizer with momentum warmup and weight decay
- Stochastic Weight Averaging to smooth weights and improve quantization