PR #500

open

Submission/2026 03 22 Sliding Window + WARMDOWN + AttnRes + PhiSimple (mean 1.1925 BPB)

val_bpb

1.1925

Architecture

Transformer

Optimizer

—

Artifact Size

14.9 MB

Training Techniques

LR Schedule

warmdown

parameters: {"warmdown_steps":20000,"description":"always-decaying LR schedule from first step for better int8 quantization"}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":960}

Architecture

Block AttnRes

learned attention over previous block outputs at block boundaries replacing fixed residual aggregation

parameters: {"block_boundary_interval":3,"added_parameters":1024,"query_count":2,"dimension":512}

PhiAlpha Simple

per-layer learnable scale on relu² activation: relu²(x) * (1 + alpha), alpha initialized to 0

parameters: null

Always-decaying learning rate schedule (WARMDOWN_ITERS=20000) to improve int8 quantization by producing tighter weight distributions with fewer outliers
Sliding window evaluation with stride 64 to score every token with 960+ context instead of average 0-1023 context
Block AttnRes: learned attention over previous block outputs at block boundaries replacing fixed residual aggregation, adding ~1024 parameters
PhiAlpha Simple: per-layer learnable scale on relu² activation with near-zero overhead