PR #509

open

Non-record: Cosine TTT 30ep on SwiGLU + U-Net (1xH100, val_bpb=1.1175)

by andrewbaggio1View on GitHub

val_bpb

1.1175

Architecture

SwiGLU + U-Net gated skip architecture

Optimizer

—

Artifact Size

7.5 MB

Training Techniques

Architecture

SwiGLU

11-layer SwiGLU MLP with hidden dimension 1792

parameters: {"layers":11,"hidden":1792}

U-Net gated skips

U-Net skip connections with learned sigmoid gating

parameters: null

BigramHash

Bigram hashing with 8192 buckets and 128 dimension embeddings

parameters: {"buckets":8192,"dimensions":128}

SmearGate

SmearGate mechanism included

parameters: null

Partial RoPE

Partial Rotary Positional Embeddings applied to 16 dims

parameters: {"dimensions":16}

LN Scale

LayerNorm scale applied as 1/sqrt(layer+1)

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9985}

Quantization

Int6 QAT

bits: 6

scope: null

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: {"epochs":30,"lr_schedule":"cosine decay"}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

cosine decay

parameters: null

Novel Contributions

Extending PR #462's SwiGLU + U-Net architecture with 30-epoch cosine learning rate decay during test-time training (TTT) instead of default 10 epochs
Demonstrated significant val_bpb improvement from 1.2531 to 1.1175 (-10.8%) on 1xH100 by increasing TTT epochs
Confirmed consistency with prior PRs (#481 and #486) on benefits of cosine TTT scheduling and longer TTT epochs
Provided timing estimates and plans for 8xH100 verification and tuning of TTT epochs for quality/time tradeoff