PR #1911

open

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)

by dttdrvView on GitHub

val_bpb

1.0354

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,995,750 bytes

Training Techniques

Test-Time Training

full TTT

parameters: {"rank":8,"epochs":21,"federated_averaging":true,"lr_schedule":"epoch-level cosine"}

Architecture

XSA

XSA applied on all layers

parameters: null

depth recurrence

3-layer depth recurrence over layers 3-5

parameters: {"layers":[3,4,5]}

U-Net skip connections

Parallel residual path from layer 7 onward

parameters: {"start_layer":7}

LeakyReLU

LeakyReLU(0.5)^2 MLP activation

parameters: {"slope":0.5}

KV head count

8-head attention with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

RoPE

Partial RoPE lineage referenced in the stack

parameters: null

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"parallel_ranks":8,"epochs":21}

Quantization

GPTQ

bits: 6

scope: model matrices

int8

bits: 8

scope: embeddings

Compression

Brotli

level: null

LZMA

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

Regularization

weight decay

parameters: {"high_wd":true}

LR Schedule

cosine decay

parameters: {"epoch_level":true}

Novel Contributions

Combines CaseOps reversible capitalization tokenization with pre-quant TTT
Adds byte-sidecar validation accounting for transformed CaseOps tokens
Threads original-byte sidecars through validation and sliding evaluation
Uses pre-quant AdamW TTT before GPTQ export to improve the fixed artifact
Achieves a new record mean val_bpb of 1.03540487 on track_10min_16mb