PR #1212

open

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)

by GusanidasView on GitHub
val_bpb
1.1108
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,726,762 bytes

Training Techniques

Architecture
BigramHash
Uses a bigram hash embedding component.
parameters: {"size":5120}
VE128
Uses value embeddings with dimension 128.
parameters: {"dimensions":128}
XSA
Builds on cross-head subtracted attention from prior work.
parameters: null
SmearGate
Uses banked weight matrices and SmearGate from prior work.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5)-squared MLP activation.
parameters: {"slope":0.5}
U-Net skip connections
Uses sigmoid-gated skip connections in place of learned scalar skip weights.
parameters: null
attention modification
Applies sliding-window attention to selected layers while keeping full attention in others.
parameters: {"window_size":512,"layers":[2,4,6,8,10]}
weight tying
Uses tied embeddings.
parameters: null
RoPE
Uses partial RoPE from prior work.
parameters: null
Sequence Length
sequence_length
train_length: 6144
eval_length: 6144
Evaluation
sliding window eval
parameters: {"stride":128,"context_length":6144}
Compression
brotli
level: null
Optimizer
Muon
weight_decay: null
momentum: 0.985
other_params: {"warmdown_iters":4000}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Weight Averaging
EMA
parameters: null

Novel Contributions

  • Window attention on selected layers using Flash Attention 3 window_size
  • Mixed sequence-length training across GPUs (2048 and 6144 within the same step)
  • Evaluation at long context length with sliding window eval
  • 12-layer configuration with tuned qk_gain and mixed short/long sequence exposure