PR #1738

open

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)

by alertcatView on GitHub
val_bpb
1.0354
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.996 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
depth recurrence
3-layer depth recurrence / virtual layers in the model stack
parameters: {"layers":3}
weight tying
Tied embeddings / embedding tying is implied by the inherited stack
parameters: null
U-Net skip connections
Parallel residual connections used in the inherited stack
parameters: null
Gated Attention
QK-Gain 5.25 attention scaling used in the inherited stack
parameters: {"qk_gain":5.25}
Test-Time Training
full TTT
parameters: {"epochs":21,"parallel_gpus":8,"federated_averaging":true}
LR Schedule
cosine decay
parameters: {"scope":"epoch-level","t_max":21,"eta_min_ratio":0.1}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: null
Other
other
CaseOps lossless-case tokenizer with byte sidecar for honest BPB accounting
parameters: null
Weight Averaging
EMA
parameters: {"disabled_in_final_run":true}
Regularization
weight decay
parameters: {"value":0}

Novel Contributions

  • Integrated CaseOps tokenizer byte-sidecar support into PR #1735's evaluation pipeline for honest BPB accounting
  • Combined SP8192 + 3-layer recurrence + parallel residuals + QK-Gain + pre-quant AdamW TTT with CaseOps tokenizer
  • Added validation byte-sidecar loading and excluded sidecar files from token globbing to avoid double counting
  • Disabled experimental TTT EMA after ablation showed it hurt performance
  • Achieved a 3-seed mean val_bpb of 1.03540, beating the record threshold