PR #297
openLate STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)
by davidpuertolasView on GitHub
val_bpb
1.1629
Architecture
GPT-style Transformer
Optimizer
Muon + AdamW
Artifact Size
15,948,643 bytes
Training Techniques
Quantization
STE QAT
bits: 6
scope: MLP and attention weight matrices / full model quantized artifact
Architecture
MLP3x
Expanded feed-forward network width to 3x the model dimension.
parameters: {"hidden":1536,"model_dim":512,"layers":9}
SmearGate
Learned gate blending current token embedding with previous token embedding for cheap bigram-like signal.
parameters: null
BigramHash
Hashed bigram embedding path keyed by adjacent token pairs.
parameters: {"buckets":4096,"dim":128}
Initialization
OrthoInit
Orthogonal initialization with Overtone-style / muP-style scaling.
Weight Averaging
SWA
parameters: {"start_frac":0.5,"every":200}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.02,"tied_embed_lr":0.03}
AdamW
weight_decay: 0.01
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0003,"momentum":0.95}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"qat_start_frac":0.85,"qat_lr_factor":0.5,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.038,"adamw_weight_decay":0.01}
Novel Contributions
- Late STE QAT activated only in the last ~15% of wallclock to reduce quantization noise during most of training.
- Int6 per-row quantization with zstd level 22 compression to fit under the 16MB artifact cap.
- 3x MLP expansion (hidden size 1536) combined with SmearGate and BigramHash architectural additions.
- Orthogonal / Overtone-style initialization for large matrices.
- SWA over the second half of warmdown before quantization.
- Full-model SGD test-time training instead of LoRA TTT.