PR #581

closed

Record: 11L Sidecar48 + Enhanced TTT (cosine LR, 20 epochs) — 1.0698 BPB (3-seed mean)

by teddyowehView on GitHub
val_bpb
1.0698
Architecture
Transformer
Optimizer
AdamW
Artifact Size
< 16 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":20,"learning_rate":0.0005,"min_learning_rate":0.00002}
LR Schedule
cosine decay
parameters: {"start_lr":0.0005,"end_lr":0.00002,"warmup_epochs":1}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Architecture
SharedSparseSidecar
Shared sparse sidecar module added to the transformer, used in layers 8-10 with 48 hidden units.
parameters: {"hidden":48,"layers":[8,9,10]}
BigramHash
BigramHash embeddings used instead of standard token embeddings.
parameters: {"vocab":2048,"dim":96}
SmearGate
Gating mechanism used within the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections added to the transformer.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
orthogonal init
Orthogonal weight initialization.
Quantization
mixed int6
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Extended test-time training from 10 to 20 epochs
  • Replaced flat TTT learning rate with cosine decay from 0.0005 to 0.00002
  • Added 1-epoch linear warmup to stabilize TTT
  • Introduced weight decay of 0.01 during TTT to reduce overfitting
  • Achieved a new leaderboard record with 1.0698 BPB mean over 3 seeds