PR #509

open

Non-record: Cosine TTT 30ep on SwiGLU + U-Net (1xH100, val_bpb=1.1175)

by andrewbaggio1View on GitHub
val_bpb
1.1175
Architecture
SwiGLU + U-Net gated skip architecture
Optimizer
Artifact Size
7.5 MB

Training Techniques

Architecture
SwiGLU
11-layer SwiGLU MLP with hidden dimension 1792
parameters: {"layers":11,"hidden":1792}
U-Net gated skips
U-Net skip connections with learned sigmoid gating
parameters: null
BigramHash
Bigram hashing with 8192 buckets and 128 dimension embeddings
parameters: {"buckets":8192,"dimensions":128}
SmearGate
SmearGate mechanism included
parameters: null
Partial RoPE
Partial Rotary Positional Embeddings applied to 16 dims
parameters: {"dimensions":16}
LN Scale
LayerNorm scale applied as 1/sqrt(layer+1)
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9985}
Quantization
Int6 QAT
bits: 6
scope: null
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: {"epochs":30,"lr_schedule":"cosine decay"}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: null

Novel Contributions

  • Extending PR #462's SwiGLU + U-Net architecture with 30-epoch cosine learning rate decay during test-time training (TTT) instead of default 10 epochs
  • Demonstrated significant val_bpb improvement from 1.2531 to 1.1175 (-10.8%) on 1xH100 by increasing TTT epochs
  • Confirmed consistency with prior PRs (#481 and #486) on benefits of cosine TTT scheduling and longer TTT epochs
  • Provided timing estimates and plans for 8xH100 verification and tuning of TTT epochs for quality/time tradeoff