PR #1934
openRecord: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)
by liujshiView on GitHub
val_bpb
1.0599
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,981,105 B
Training Techniques
Architecture
weight tying
Tied embeddings / embedding tying used in the model.
parameters: null
U-Net skip connections
U-Net style skip connections are used.
parameters: null
parallel residuals
Parallel attention and MLP residual paths starting at a later layer.
parameters: {"start_layer":8}
Partial RoPE
Rotary position embeddings applied only to a subset of dimensions.
parameters: {"dimensions":16,"base":10000}
depth recurrence
Looped recurrence over selected layers.
parameters: {"layers":[3,4,5],"num_loops":2}
SmearGate
Smear gate with a fixed window to smooth representations.
parameters: {"window":12}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: null
KV head count
Grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: weights and embeddings
mixed int6/int7
bits: null
scope: embeddings
LQER
bits: 2
scope: top-3 tensors
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"backend":"Polar-Express Newton-Schulz"}
Regularization
weight decay
parameters: {"embed_wd":0.06,"ttt_weight_decay":1}
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000,"lora_rank":96}
Compression
lrzip+brotli
level: 9
Evaluation
phased TTT eval
parameters: {"phases":3,"prefix_docs":2000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Novel Contributions
- CaseOps bijective case transform for SP8192
- Per-group lrzip + brotli compression pipeline
- Tightened GPTQ quantization clips
- Lower embed weight decay
- Phased score-first TTT evaluation
- Sparse attention head-output gate and SmearGate
- Depth recurrence with parallel residuals and U-Net skips