PR #672
openRecord: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781)
by andrewbaggio1View on GitHub
val_bpb
1.0781
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.62 MB
Training Techniques
Architecture
LeakyReLU² stack
11-layer Transformer stack using LeakyReLU(0.5) squared MLPs with several custom architectural components.
parameters: {"layers":11,"d_model":512,"gqa_heads":"8/4","mlp_multiplier":3,"bigram_hash":2048,"partial_rope_dims":16}
BigramHash
Bigram hashing component used in the model.
parameters: {"size":2048}
SmearGate
Custom gating mechanism included in the architecture.
parameters: null
XSA4
Custom attention-like architectural component.
parameters: null
Partial RoPE
Rotary positional embeddings applied only to part of the representation.
parameters: {"dimensions":16}
KV GQA
Grouped-query attention with reduced KV heads.
parameters: {"heads":"8/4"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
int6
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":30,"optimizer":"AdamW","learning_rate":0.0005,"lr_schedule":"cosine decay","per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5}}
Initialization
OrthoInit
Orthogonal initialization.
LR Schedule
cosine decay
parameters: {"phase":"TTT","epochs":30}
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- Increased TTT epochs to 30 while keeping the architecture identical to PR #518
- Achieved a 3-seed mean validation BPB of 1.0781
- Used cosine-decayed test-time training with per-layer learning-rate groups
- Maintained artifact size under 16 MB