val_bpb
1.0785
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.934 MB
Training Techniques
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary position embeddings over a subset of head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Gated Attention
Uses GDN mixer layers based on chunk_gated_delta_rule in selected blocks.
parameters: {"layers":[0,1,10]}
XSA
Enables XSA on all full-attention layers.
parameters: {"layers":[2,3,4,5,6,7,8,9]}
attention sink
Adds a learned sink key per KV head with zero value vector in full-attention layers.
parameters: {"sink_keys_per_kv_head":1}
depth recurrence
Uses a loop segment activated partway through training.
parameters: {"activated_at_wallclock_fraction":0.4}
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activations.
parameters: {"slope":0.5}
Regularization
logit softcap
parameters: {"value":20}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
mixed int6/int8
bits: null
scope: embeddings and selected attention projections
late QAT
bits: null
scope: attn.c_k.weight and attn.proj.weight
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"gradient_clip":1,"learning_rate":0.017}
LR Schedule
cosine decay
parameters: {"chunk_lr":true}
warmdown
parameters: {"value":0.85}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","trainable_set":"mlp only","learning_rate":0.017,"epochs":5,"chunk_size":65536}
Compression
Brotli
level: 11
Novel Contributions
- 3 GDN mixer layers combined with 8 full-attention layers under the SP8192 base shape
- Learned attention sink keys per KV head for full-attention layers
- Score-first legal TTT with MLP-only SGD on already-scored chunks
- Late QAT noise applied only near the end of training to improve quantization robustness
- Mixed-bit quantization with Hadamard-rotated attention V/O pairs and frequency-split embeddings
- 4096-token training/evaluation context with separate compiled full-attention segment