val_bpb
1.1732
Architecture
Transformer
Optimizer
Muon
Artifact Size
15846677 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings with fp16 passthrough for the tied embedding/output-head tensor.
parameters: null
Quantization
mixed int6/int8
bits: 8
scope: all weights by default, with middle blocks 3,4,5,6 forced to int6; embeddings and LM head kept fp16
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Initialization
spectral init
Uses spectral embedding initialization.
resid mix
Uses phase residual mixing initialization.
Evaluation
sliding window eval
parameters: {"context_length":1024,"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zlib
level: null
Regularization
weight decay
parameters: {"value":0.02}
Novel Contributions
- Improved the prior valid mid6 run by lowering TIED_EMBED_LR from 0.10 to 0.08.
- Kept the 10-layer sliding-window family recipe with 1024/64 sliding evaluation.
- Used a mixed export policy with only middle blocks 3,4,5,6 forced to int6 while keeping embeddings and LM head in fp16.
- Retained the stronger Muon crossover schedule with warmup and warmdown settings.
- Achieved a new best valid score for this submission family under the 16MB cap.