PR #170
openRecord: Int6 QAT + SmearGate + Muon WD (val_bpb=1.1669)
by baudrillardsgh0stView on GitHub
val_bpb
1.1669
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.77 MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all weights
Compression
zstd
level: 22
Architecture
SmearGate
Learned gate blending current and previous token embeddings to add cheap bigram context.
parameters: {"params":513}
tied embeddings
Input/output embeddings are tied, with fp16 passthrough to avoid compounding quantization errors.
parameters: null
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: {"decoupled_weight_decay":true}
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.01,"decoupled":true}
Novel Contributions
- Int6 QAT with STE fake quantization and per-row symmetric scaling
- Int6 values stored in int8 containers with zstd-22 compression
- SmearGate learned embedding-level bigram context
- Decoupled Muon weight decay for improved generalization and quantization robustness
- Sliding-window full-context evaluation
- FP16 tied embedding passthrough