PR #1895

open

Record: SP8192 + 3GDN/8FA 4096 + Attention Sink

val_bpb
1.0785
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.934 MB

Training Techniques

Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary position embeddings over a subset of head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Gated Attention
Uses GDN mixer layers based on chunk_gated_delta_rule in selected blocks.
parameters: {"layers":[0,1,10]}
XSA
Enables XSA on all full-attention layers.
parameters: {"layers":[2,3,4,5,6,7,8,9]}
attention sink
Adds a learned sink key per KV head with zero value vector in full-attention layers.
parameters: {"sink_keys_per_kv_head":1}
depth recurrence
Uses a loop segment activated partway through training.
parameters: {"activated_at_wallclock_fraction":0.4}
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activations.
parameters: {"slope":0.5}
Regularization
logit softcap
parameters: {"value":20}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
mixed int6/int8
bits: null
scope: embeddings and selected attention projections
late QAT
bits: null
scope: attn.c_k.weight and attn.proj.weight
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"gradient_clip":1,"learning_rate":0.017}
LR Schedule
cosine decay
parameters: {"chunk_lr":true}
warmdown
parameters: {"value":0.85}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","trainable_set":"mlp only","learning_rate":0.017,"epochs":5,"chunk_size":65536}
Compression
Brotli
level: 11

Novel Contributions

  • 3 GDN mixer layers combined with 8 full-attention layers under the SP8192 base shape
  • Learned attention sink keys per KV head for full-attention layers
  • Score-first legal TTT with MLP-only SGD on already-scored chunks
  • Late QAT noise applied only near the end of training to improve quantization robustness
  • Mixed-bit quantization with Hadamard-rotated attention V/O pairs and frequency-split embeddings
  • 4096-token training/evaluation context with separate compiled full-attention segment