PR #771
openRecord: AdamW TTT 30ep Cosine + Per-Layer LR (val_bpb: 1.0705)
by sunnypatneediView on GitHub
val_bpb
1.0705
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.8 MB
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
MLP3x
3x expansion MLP with LeakyReLU(0.5)^2 activation in the base model
parameters: {"expansion":3}
BigramHash
BigramHash component used in the base architecture
parameters: {"size":2048}
XSA
XSA applied in the last 4 layers
parameters: {"layers":4}
RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16,"total_dimensions":64}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50}
Compression
zstd
level: 22
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"per_layer_lr":{"mlp.proj":0.0015,"mlp.fc":0.00025,"other":0.0005}}
LR Schedule
cosine decay
parameters: {"epochs":30,"final_lr":0}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":30,"cosine":true,"per_layer_lr":true,"freeze_blocks":0,"batch_seqs":64,"max_steps":300}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Replaced weak 3-epoch SGD test-time training with AdamW-based TTT
- Used 30 epochs of cosine-decayed learning rate during TTT
- Applied per-layer learning rates, boosting mlp.proj and reducing mlp.fc
- Unfroze all blocks during TTT
- Achieved a new record val_bpb of 1.0705 on the PR #549 base