PR #546
closedInt5/Int6+Zstd+MLP3x: mean val_bpb=1.1752 (10L, seq4096, sliding window)
by shajalahamedcseView on GitHub
val_bpb
1.1752
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,708,798 B
Training Techniques
Quantization
mixed int5/int6
bits: 5
scope: MLP matrices
mixed int5/int6
bits: 6
scope: attention matrices
Architecture
MLP3x
Expanded MLP hidden size from 1024 to 1536 using savings from quantization.
parameters: {"hidden":1536,"baseline_hidden":1024}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":3600}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.04}
Initialization
Overtone init
Regularization
weight decay
parameters: null
Novel Contributions
- Int5 quantization for MLP matrices to free artifact space
- Int6 quantization for attention matrices
- Zstd compression of quantized integer arrays
- 3x MLP expansion enabled by quantization savings
- Training on 4096-token sequences
- Sliding window evaluation with stride 64