PR #500

open

Submission/2026 03 22 Sliding Window + WARMDOWN + AttnRes + PhiSimple (mean 1.1925 BPB)

by ikermoelView on GitHub
val_bpb
1.1925
Architecture
Transformer
Optimizer
Artifact Size
14.9 MB

Training Techniques

LR Schedule
warmdown
parameters: {"warmdown_steps":20000,"description":"always-decaying LR schedule from first step for better int8 quantization"}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":960}
Architecture
Block AttnRes
learned attention over previous block outputs at block boundaries replacing fixed residual aggregation
parameters: {"block_boundary_interval":3,"added_parameters":1024,"query_count":2,"dimension":512}
PhiAlpha Simple
per-layer learnable scale on relu² activation: relu²(x) * (1 + alpha), alpha initialized to 0
parameters: null

Novel Contributions

  • Always-decaying learning rate schedule (WARMDOWN_ITERS=20000) to improve int8 quantization by producing tighter weight distributions with fewer outliers
  • Sliding window evaluation with stride 64 to score every token with 960+ context instead of average 0-1023 context
  • Block AttnRes: learned attention over previous block outputs at block boundaries replacing fixed residual aggregation, adding ~1024 parameters
  • PhiAlpha Simple: per-layer learnable scale on relu² activation with near-zero overhead