PR #85
openRecord (pending): 92-experiment autoresearch + sliding window eval, pre-quant val_bpb=1.2156
by hydeh3r3View on GitHub
val_bpb
1.2156
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
RoPE
RoPE extrapolation used to train at 1024 and evaluate at 2048 via a larger rope base.
parameters: {"rope_base":50000}
tied embeddings
Input and output embeddings are tied.
parameters: null
MLP2x
Uses a 2x MLP width configuration.
parameters: {"mlp_multiplier":2}
relu²
Uses squared ReLU activation for cleaner int8 quantization.
parameters: null
RMSNorm
Uses plain RMSNorm instead of WeightedRMSNorm for better quantization behavior.
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"grad_clip_norm":1,"momentum_warmup_start":0.92,"momentum_warmup_end":0.99,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scope":"embeddings"}
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":5}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Other
other
Autoresearch loop used to run 92+ automated local experiments and select the best configuration.
parameters: {"experiments":92}
other
Training on validation shard was enabled.
parameters: {"train_on_val":1}
Novel Contributions
- 92+ automated local experiments via an autoresearch loop
- Sliding window evaluation with stride 64
- RoPE extrapolation to evaluate at 2048 context length
- relu² activation and plain RMSNorm chosen for cleaner int8 quantization
- Muon optimizer with separate AdamW for embeddings
- Training on validation shard to improve score
- Quantization pipeline using int8 with zlib compression