PR #1738

open

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)

by alertcatView on GitHub

val_bpb

1.0354

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.996 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

depth recurrence

3-layer depth recurrence / virtual layers in the model stack

parameters: {"layers":3}

weight tying

Tied embeddings / embedding tying is implied by the inherited stack

parameters: null

U-Net skip connections

Parallel residual connections used in the inherited stack

parameters: null

Gated Attention

QK-Gain 5.25 attention scaling used in the inherited stack

parameters: {"qk_gain":5.25}

Test-Time Training

full TTT

parameters: {"epochs":21,"parallel_gpus":8,"federated_averaging":true}

LR Schedule

cosine decay

parameters: {"scope":"epoch-level","t_max":21,"eta_min_ratio":0.1}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

lzma

level: null

Other

other

CaseOps lossless-case tokenizer with byte sidecar for honest BPB accounting

parameters: null

Weight Averaging

EMA

parameters: {"disabled_in_final_run":true}

Regularization

weight decay

parameters: {"value":0}

Novel Contributions

Integrated CaseOps tokenizer byte-sidecar support into PR #1735's evaluation pipeline for honest BPB accounting
Combined SP8192 + 3-layer recurrence + parallel residuals + QK-Gain + pre-quant AdamW TTT with CaseOps tokenizer
Added validation byte-sidecar loading and excluded sidecar files from token globbing to avoid double counting
Disabled experimental TTT EMA after ablation showed it hurt performance
Achieved a 3-seed mean val_bpb of 1.03540, beating the record threshold