PR #481
closedRecord: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)
by mrdavtanView on GitHub
val_bpb
1.0970
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.4-15.8 MB
Training Techniques
Quantization
int6
bits: 6
scope: per-row all weights
Compression
zstd
level: 22
Architecture
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":"16/64"}
LN Scale
LayerNorm scaling modification.
parameters: null
SmearGate
Custom gating mechanism in the community stack.
parameters: null
BigramHash
Bigram hashing component with 2048 buckets.
parameters: {"buckets":2048}
MLP3x
Three-times wider/deeper MLP stack using relu-squared activations.
parameters: {"multiplier":3}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skips
Skip connections inspired by U-Net are used in the architecture.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"TTT":true,"learning_rate":0.0005}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","learning_rate":0.0005,"epochs":30,"cosine_decay":true,"per_layer_lr":true,"freeze_blocks":0,"batch_seqs_per_gpu":64}
LR Schedule
cosine decay
parameters: {"epochs":30}
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- Cosine learning-rate decay for TTT over 30 epochs instead of a flat schedule
- Per-layer TTT learning-rate groups based on measured quantization damage
- Analysis showing MLP output projections have much higher quantization error than input projections
- Demonstration that TTT improves beyond merely repairing quantization damage
- Extensive negative-result exploration of alternative compression and architectural ideas