PR #771

open

Record: AdamW TTT 30ep Cosine + Per-Layer LR (val_bpb: 1.0705)

by sunnypatneediView on GitHub

val_bpb

1.0705

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.8 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

MLP3x

3x expansion MLP with LeakyReLU(0.5)^2 activation in the base model

parameters: {"expansion":3}

BigramHash

BigramHash component used in the base architecture

parameters: {"size":2048}

XSA

XSA applied in the last 4 layers

parameters: {"layers":4}

RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16,"total_dimensions":64}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50}

Compression

zstd

level: 22

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"per_layer_lr":{"mlp.proj":0.0015,"mlp.fc":0.00025,"other":0.0005}}

LR Schedule

cosine decay

parameters: {"epochs":30,"final_lr":0}

Test-Time Training

full TTT

parameters: {"learning_rate":0.0005,"epochs":30,"cosine":true,"per_layer_lr":true,"freeze_blocks":0,"batch_seqs":64,"max_steps":300}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Replaced weak 3-epoch SGD test-time training with AdamW-based TTT
Used 30 epochs of cosine-decayed learning rate during TTT
Applied per-layer learning rates, boosting mlp.proj and reducing mlp.fc
Unfroze all blocks during TTT
Achieved a new record val_bpb of 1.0705 on the PR #549 base