PR #1219
openRecord: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb
by GusanidasView on GitHub
val_bpb
1.1084
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
—
Training Techniques
Architecture
attention
Window attention applied on layers 2, 4, 6, 8, and 10 using FlashAttention 3.
parameters: {"layers":[2,4,6,8,10],"window_size":512}
Causal n-gram fix
Fixed within_hint/word_hint to be prefix-only for causal behavior.
parameters: null
Sequence Length
sequence_length
train_length: 6144
eval_length: 6144
sequence_length
train_length: 2048
eval_length: null
sequence_length
train_length: 6144
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":128,"context_length":6144}
Quantization
GPTQ
bits: 6
scope: train-data calibration
Novel Contributions
- Causal n-gram prefix-only fix for within_hint/word_hint
- Window attention on selected layers via FlashAttention 3
- Mixed sequence-length training across 2048 and 6144 token batches
- Train-data GPTQ calibration for faster quantization setup
- Automatic eval sequence-length detection from maximum training length
- Sliding-window evaluation at 6144 context with stride 128