PR #1934

open

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)

by liujshiView on GitHub
val_bpb
1.0599
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,981,105 B

Training Techniques

Architecture
weight tying
Tied embeddings / embedding tying used in the model.
parameters: null
U-Net skip connections
U-Net style skip connections are used.
parameters: null
parallel residuals
Parallel attention and MLP residual paths starting at a later layer.
parameters: {"start_layer":8}
Partial RoPE
Rotary position embeddings applied only to a subset of dimensions.
parameters: {"dimensions":16,"base":10000}
depth recurrence
Looped recurrence over selected layers.
parameters: {"layers":[3,4,5],"num_loops":2}
SmearGate
Smear gate with a fixed window to smooth representations.
parameters: {"window":12}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: null
KV head count
Grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: weights and embeddings
mixed int6/int7
bits: null
scope: embeddings
LQER
bits: 2
scope: top-3 tensors
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"backend":"Polar-Express Newton-Schulz"}
Regularization
weight decay
parameters: {"embed_wd":0.06,"ttt_weight_decay":1}
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000,"lora_rank":96}
Compression
lrzip+brotli
level: 9
Evaluation
phased TTT eval
parameters: {"phases":3,"prefix_docs":2000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}

Novel Contributions

  • CaseOps bijective case transform for SP8192
  • Per-group lrzip + brotli compression pipeline
  • Tightened GPTQ quantization clips
  • Lower embed weight decay
  • Phased score-first TTT evaluation
  • Sparse attention head-output gate and SmearGate
  • Depth recurrence with parallel residuals and U-Net skips