PR #500
openSubmission/2026 03 22 Sliding Window + WARMDOWN + AttnRes + PhiSimple (mean 1.1925 BPB)
by ikermoelView on GitHub
val_bpb
1.1925
Architecture
Transformer
Optimizer
—
Artifact Size
14.9 MB
Training Techniques
LR Schedule
warmdown
parameters: {"warmdown_steps":20000,"description":"always-decaying LR schedule from first step for better int8 quantization"}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":960}
Architecture
Block AttnRes
learned attention over previous block outputs at block boundaries replacing fixed residual aggregation
parameters: {"block_boundary_interval":3,"added_parameters":1024,"query_count":2,"dimension":512}
PhiAlpha Simple
per-layer learnable scale on relu² activation: relu²(x) * (1 + alpha), alpha initialized to 0
parameters: null
Novel Contributions
- Always-decaying learning rate schedule (WARMDOWN_ITERS=20000) to improve int8 quantization by producing tighter weight distributions with fewer outliers
- Sliding window evaluation with stride 64 to score every token with 960+ context instead of average 0-1023 context
- Block AttnRes: learned attention over previous block outputs at block boundaries replacing fixed residual aggregation, adding ~1024 parameters
- PhiAlpha Simple: per-layer learnable scale on relu² activation with near-zero overhead