val_bpb
1.1373
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,034,550 B
Training Techniques
Architecture
depth recurrence
9 unique flat transformer blocks followed by 1 shared crawler block looping 2 times with differentiated RoPE scales.
parameters: {"layers":9,"crawler_layers":1,"loops":2}
RoPE
Differentiated RoPE scales for the crawler loop.
parameters: {"scales":[9,1,1]}
Gated Attention
Uses QK gain initialization to sharpen attention gradients.
parameters: {"qk_gain_init":4}
Quantization
GPTQ
bits: 8
scope: block weights
Compression
brotli
level: 11
LR Schedule
warmdown
parameters: {"warmdown_iters":2000}
Evaluation
sliding window eval
parameters: {"window":"sliding"}
Novel Contributions
- Loop-aware GPTQ with two-phase Hessian calibration for crawler-weight quantization
- Brotli compression to reduce artifact size and free headroom for GPTQ overhead
- QK gain initialization at 4.0 for sharper initial attention gradients
- Two-loop crawler cadence to improve throughput and increase training steps within the wallclock budget
- Optimized warmdown schedule