val_bpb
1.0818
Architecture
Transformer
Optimizer
—
Artifact Size
15,991,530 bytes
Training Techniques
Architecture
depth recurrence
Uses 3-layer recurrence over layers 3-5 in the PR1720 stack.
parameters: {"layers":3,"start_layer":3,"end_layer":5}
parallel residuals
Applies parallel residual connections starting at layer 7.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: null
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.026}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Compression
Brotli
level: null
Novel Contributions
- Non-record negative result for a PR1720 no-val-pause variant
- Skipped mid-training validation by setting VAL_LOSS_EVERY=99999
- Documented that the expected step-recovery mechanism failed
- Reported a slightly better single-seed val_bpb than the reference but with earlier stopping