PR #1489
openRecord: SP1024 + Pre-quant TTT + Parallel Residuals — 1.0736 BPB (beats 1.1147 by 3.66%)
by joshkmartinezView on GitHub
val_bpb
1.0736
Architecture
Transformer
Optimizer
AdamW
Artifact Size
13.87 MB
Training Techniques
Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":6,"freeze_blocks":2,"batch_seqs":32,"grad_clip":1}
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
NTK-aware RoPE used for attention positional encoding.
parameters: {"base":10000,"train_seq":2048}
depth recurrence
Looping over layers 4-5 for two loops.
parameters: {"loops":2,"start_layer":4,"end_layer":5}
parallel residuals
Parallel residual connections added from deeper layers onward.
parameters: {"start_layer":7}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.085
momentum: 0.99
other_params: {"warmup_momentum_start":0.92}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"prequant_ttt_lr":0.0005,"prequant_ttt_epochs":6}
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
ETLB (Enhanced Token-Level Blending) used during evaluation to improve val_bpb.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
cosine decay
parameters: null
Regularization
weight decay
parameters: {"value":0.085}
Novel Contributions
- Pre-quantization test-time training before GPTQ quantization
- SP1024 custom tokenizer with 1024 vocabulary
- Parallel residual connections from layer 7 onward
- Higher QK-Gain setting of 5.0
- High-decay EMA stabilization