PR #929
openAdd record: 9L MLP3x LeakyReLU(0.5)² QAT Int6+zstd (val_bpb=1.1653)
by andreanjosView on GitHub
val_bpb
1.1653
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.03MB
Training Techniques
Architecture
MLP3x
3x MLP with 1536 hidden units
parameters: {"layers":9,"hidden_size":1536,"mlp_mult":2}
LeakyReLU
LeakyReLU(0.5)² activation in the MLP
parameters: {"negative_slope":0.5}
weight tying
Tied input and output embeddings
parameters: null
Quantization
STE QAT
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":10000}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":10,"beta2":0.99,"grad_clip_norm":1,"scalar_lr":0.02}
Novel Contributions
- Int6 quantization-aware training with STE fake-quantization
- zstd-22 compression of the final artifact
- Sliding window evaluation with stride 64
- Longer training sequence length of 2048
- Extended warmdown schedule and Muon optimizer tuning
- LeakyReLU(0.5)² MLP activation