val_bpb
1.1216
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.15 MB
Training Techniques
Quantization
late QAT
bits: 6
scope: all
QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start_step":null,"every_steps":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
GQA
Grouped query attention with fewer KV heads than query heads
parameters: {"query_heads":8,"kv_heads":4}
ReLU²
Squared ReLU activation
parameters: null
LeakyReLU
Leaky ReLU activation
parameters: {"slope":0.5}
XSA
XSA applied to the last layers
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"base_dimensions":64}
LN Scale
LayerNorm scale modification
parameters: null
BigramHash
Bigram hash embedding feature
parameters: {"vocab_size":2048,"dim":128}
SmearGate
SmearGate gating mechanism
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":null}
Regularization
LN scale
parameters: null
Novel Contributions
- Quantization Noise Annealing (QNA) to inject int6-like noise during training
- Stochastic Quantized Weight Averaging (SQWA) using quantize-dequantize EMA snapshots
- Controlled 3-run ablation showing reduced quantization gap without improving final val_bpb
- Analysis that float model quality, not quantization error, is the main bottleneck