PR #1858
openRecord: Score-First TTT + PPM-D Byte Mixture — mix_bpb 0.9946 (3-seed mean)
by G3sparkyView on GitHub
val_bpb
0.9946
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,997,375 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}
Evaluation
sliding window eval
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP; int8 embeddings
Architecture
depth recurrence
Layers 3-5 loop with recurrence
parameters: {"layers":[3,4,5],"num_loops":2,"activated_at_frac":0.35}
weight tying
Tied embeddings
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
LeakyReLU
Leaky ReLU activation used in MLP
parameters: {"slope":0.5}
XSA
XSA applied on all layers
parameters: null
GQA
Grouped-query attention with 4 KV heads
parameters: {"kv_heads":4}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Optimizer
SGD
weight_decay: 0.095
momentum: 0.9
other_params: {"muon_variant":"MuonEq-R","newton_schulz_steps":5}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
cosine decay
parameters: {"warmdown_frac":0.72}
Compression
lzma
level: null
brotli
level: 11
Other
other
PPM-D byte mixture with binary-lambda gate for eval-time probability mixing
parameters: {"order":5,"confidence_threshold":0.9,"lambda_high":0.05,"lambda_low":0.9}
Novel Contributions
- Legal score-first TTT with 3-epoch SGD per chunk
- PPM-D byte mixture with score-before-update ordering
- Binary-lambda gate for mixing neural and PPM-D probabilities
- Self-extracting LZMA-compressed code wrapper
- Brotli-11 model compression