PR #2098
openRecord: PR #1873 base + tuned PPM gate (T=0.7/H=0.99/L=0.3) — val_bpb 0.80051 (3-seed mean)
by joshuaswansonView on GitHub
val_bpb
0.8005
Architecture
Transformer
Optimizer
—
Artifact Size
<16MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP; embeddings int8
Architecture
weight tying
Tied token embeddings
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
depth recurrence
Encoder/decoder layer recurrence with repeated layer loops
parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}
LeakyReLU
LeakyReLU activation with squared variant mentioned in the inherited stack
parameters: {"slope":0.5}
Regularization
layerwise LN scale
parameters: null
Test-Time Training
TTT
parameters: {"learning_rate":0.008,"epochs":4}
Other
other
Causal byte-level PPM-D mixture with tuned confidence gate over NN and PPM log-probabilities
parameters: {"PPM_C":0.7,"PPM_LHI":0.99,"PPM_LLO":0.3,"PPM_ORDER":5}
Novel Contributions
- Offline sweep of PPM gate hyperparameters on dumped NN distribution
- Improved causal PPM-D gate settings: PPM_C=0.7, PPM_LHI=0.99, PPM_LLO=0.3
- Direct lineage from PR #1873 with byte-identical training pipeline and only runtime hyperparameter changes
- 3-seed mean validation improvement from 0.82006 to 0.80051