val_bpb
1.0796
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,989,120 bytes
Training Techniques
Architecture
depth recurrence
3-layer depth recurrence over layers 3-5
parameters: {"layers":3,"start_layer":3,"end_layer":5}
weight tying
SP8192 tokenizer/model stack uses tied embeddings-style compact parameterization as part of the base family
parameters: null
GQA
8 attention heads with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
parallel residuals
Parallel residual connections from layer 7 onward
parameters: {"start_layer":7}
Quantization
GPTQ
bits: 6
scope: matrices and embeddings
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
brotli
level: null
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":16384,"learning_rate":0.0075,"epochs":3}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Optimizer
AdamW
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022,"warmdown_frac":0.72}
Initialization
OrthoInit
Novel Contributions
- Completed 8xH100 SP8192 Family3+ run for the 10min/16MB track
- Raised QK gain initialization to 5.75
- Used 16K-token score-first TTT with learning rate 0.0075
- Applied GPTQ SDClip int6 matrices and int8 token embeddings
- Used EMA with decay 0.9965 and brotli artifact compression
- Achieved legal score-first TTT exact val_bpb of 1.07955293