PR #661

open

Non-record: 30ep Cosine TTT on SwiGLU + U-Net (1xH100, val_bpb=1.1175)

by andrewbaggio1View on GitHub
val_bpb
1.1175
Architecture
Transformer
Optimizer
AdamW
Artifact Size
7.5 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":30,"schedule":"cosine","seed":1337}
LR Schedule
cosine decay
parameters: {"ttt_epochs":30}
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
SwiGLU
SwiGLU MLP variant used in the model stack
parameters: {"hidden":1792}
U-Net
U-Net style gated skip connections
parameters: null
BigramHash
Bigram hashing component for token representation
parameters: {"buckets":8192}
SmearGate
Gating mechanism included in the architecture
parameters: null
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":16}
KV head count
Uses 8 KV heads in the attention stack
parameters: {"kv_heads":8,"heads":8,"layers":11,"dim":512}
Weight Averaging
EMA
parameters: {"decay":0.9985}
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Increased TTT epochs from 10 to 30 while keeping the PR #462 architecture unchanged
  • Applied cosine test-time training with per-layer learning rate
  • Used the SwiGLU + U-Net architecture with gated skip connections
  • Combined BigramHash, SmearGate, Partial RoPE, EMA, Late QAT, and Int6 + zstd compression
  • Reported improved sliding-window val_bpb of 1.1175 on 1xH100