PR #1219

open

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb

val_bpb

1.1084

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

—

Training Techniques

Architecture

attention

Window attention applied on layers 2, 4, 6, 8, and 10 using FlashAttention 3.

parameters: {"layers":[2,4,6,8,10],"window_size":512}

Causal n-gram fix

Fixed within_hint/word_hint to be prefix-only for causal behavior.

parameters: null

Sequence Length

sequence_length

train_length: 6144

eval_length: 6144

sequence_length

train_length: 2048

eval_length: null

sequence_length

train_length: 6144

eval_length: null

Evaluation

sliding window eval

parameters: {"stride":128,"context_length":6144}

Quantization

GPTQ

bits: 6

scope: train-data calibration