PR #1212

open

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)

by GusanidasView on GitHub

val_bpb

1.1108

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,726,762 bytes

Training Techniques

Architecture

BigramHash

Uses a bigram hash embedding component.

parameters: {"size":5120}

VE128

Uses value embeddings with dimension 128.

parameters: {"dimensions":128}

XSA

Builds on cross-head subtracted attention from prior work.

parameters: null

SmearGate

Uses banked weight matrices and SmearGate from prior work.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)-squared MLP activation.

parameters: {"slope":0.5}

U-Net skip connections

Uses sigmoid-gated skip connections in place of learned scalar skip weights.

parameters: null

attention modification

Applies sliding-window attention to selected layers while keeping full attention in others.

parameters: {"window_size":512,"layers":[2,4,6,8,10]}

weight tying

Uses tied embeddings.

parameters: null

RoPE

Uses partial RoPE from prior work.

parameters: null

Sequence Length

sequence_length

train_length: 6144

eval_length: 6144

Evaluation

sliding window eval

parameters: {"stride":128,"context_length":6144}

Compression

brotli

level: null

Optimizer

Muon

weight_decay: null

momentum: 0.985

other_params: {"warmdown_iters":4000}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Weight Averaging

EMA

parameters: null

Novel Contributions

Window attention on selected layers using Flash Attention 3 window_size
Mixed sequence-length training across GPUs (2048 and 6144 within the same step)
Evaluation at long context length with sliding window eval
12-layer configuration with tuned qk_gain and mixed short/long sequence exposure