PR #672

open

Record: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781)

by andrewbaggio1View on GitHub
val_bpb
1.0781
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.62 MB

Training Techniques

Architecture
LeakyReLU² stack
11-layer Transformer stack using LeakyReLU(0.5) squared MLPs with several custom architectural components.
parameters: {"layers":11,"d_model":512,"gqa_heads":"8/4","mlp_multiplier":3,"bigram_hash":2048,"partial_rope_dims":16}
BigramHash
Bigram hashing component used in the model.
parameters: {"size":2048}
SmearGate
Custom gating mechanism included in the architecture.
parameters: null
XSA4
Custom attention-like architectural component.
parameters: null
Partial RoPE
Rotary positional embeddings applied only to part of the representation.
parameters: {"dimensions":16}
KV GQA
Grouped-query attention with reduced KV heads.
parameters: {"heads":"8/4"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
int6
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":30,"optimizer":"AdamW","learning_rate":0.0005,"lr_schedule":"cosine decay","per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5}}
Initialization
OrthoInit
Orthogonal initialization.
LR Schedule
cosine decay
parameters: {"phase":"TTT","epochs":30}
Regularization
layerwise LN scale
parameters: null

Novel Contributions

  • Increased TTT epochs to 30 while keeping the architecture identical to PR #518
  • Achieved a 3-seed mean validation BPB of 1.0781
  • Used cosine-decayed test-time training with per-layer learning-rate groups
  • Maintained artifact size under 16 MB