PR #661

open

Non-record: 30ep Cosine TTT on SwiGLU + U-Net (1xH100, val_bpb=1.1175)

by andrewbaggio1View on GitHub

val_bpb

1.1175

Architecture

Transformer

Optimizer

AdamW

Artifact Size

7.5 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":30,"schedule":"cosine","seed":1337}

LR Schedule

cosine decay

parameters: {"ttt_epochs":30}

Evaluation

sliding window eval

parameters: {"stride":64}

Architecture

SwiGLU

SwiGLU MLP variant used in the model stack

parameters: {"hidden":1792}

U-Net

U-Net style gated skip connections

parameters: null

BigramHash

Bigram hashing component for token representation

parameters: {"buckets":8192}

SmearGate

Gating mechanism included in the architecture

parameters: null

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"dimensions":16}

KV head count

Uses 8 KV heads in the attention stack

parameters: {"kv_heads":8,"heads":8,"layers":11,"dim":512}

Weight Averaging

EMA

parameters: {"decay":0.9985}

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Increased TTT epochs from 10 to 30 while keeping the PR #462 architecture unchanged
Applied cosine test-time training with per-layer learning rate
Used the SwiGLU + U-Net architecture with gated skip connections
Combined BigramHash, SmearGate, Partial RoPE, EMA, Late QAT, and Int6 + zstd compression
Reported improved sliding-window val_bpb of 1.1175 on 1xH100