PR #1219

open

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb

by GusanidasView on GitHub
val_bpb
1.1084
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size

Training Techniques

Architecture
attention
Window attention applied on layers 2, 4, 6, 8, and 10 using FlashAttention 3.
parameters: {"layers":[2,4,6,8,10],"window_size":512}
Causal n-gram fix
Fixed within_hint/word_hint to be prefix-only for causal behavior.
parameters: null
Sequence Length
sequence_length
train_length: 6144
eval_length: 6144
sequence_length
train_length: 2048
eval_length: null
sequence_length
train_length: 6144
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":128,"context_length":6144}
Quantization
GPTQ
bits: 6
scope: train-data calibration

Novel Contributions

  • Causal n-gram prefix-only fix for within_hint/word_hint
  • Window attention on selected layers via FlashAttention 3
  • Mixed sequence-length training across 2048 and 6144 token batches
  • Train-data GPTQ calibration for faster quantization setup
  • Automatic eval sequence-length detection from maximum training length
  • Sliding-window evaluation at 6144 context with stride 128