PR #581

closed

Record: 11L Sidecar48 + Enhanced TTT (cosine LR, 20 epochs) — 1.0698 BPB (3-seed mean)

by teddyowehView on GitHub

val_bpb

1.0698

Architecture

Transformer

Optimizer

AdamW

Artifact Size

< 16 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":20,"learning_rate":0.0005,"min_learning_rate":0.00002}

LR Schedule

cosine decay

parameters: {"start_lr":0.0005,"end_lr":0.00002,"warmup_epochs":1}

Regularization

weight decay

parameters: {"weight_decay":0.01}

Architecture

SharedSparseSidecar

Shared sparse sidecar module added to the transformer, used in layers 8-10 with 48 hidden units.

parameters: {"hidden":48,"layers":[8,9,10]}

BigramHash

BigramHash embeddings used instead of standard token embeddings.

parameters: {"vocab":2048,"dim":96}

SmearGate

Gating mechanism used within the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections added to the transformer.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Initialization

orthogonal init

Orthogonal weight initialization.

Quantization

mixed int6

bits: 6

scope: model weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Extended test-time training from 10 to 20 epochs
Replaced flat TTT learning rate with cosine decay from 0.0005 to 0.00002
Added 1-epoch linear warmup to stabilize TTT
Introduced weight decay of 0.01 during TTT to reduce overfitting
Achieved a new leaderboard record with 1.0698 BPB mean over 3 seeds