PR #518
closedRecord: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)
by sofiabodView on GitHub
val_bpb
1.0622
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Architecture
XSA
Cross/self-attention variant applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"total_dimensions":64}
MLP3x
Transformer MLP widened to 3x
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: {"vocab_size":1024}
BigramHash
Bigram hashing feature/module used in the model
parameters: {"hash_size":2048,"dimension":128}
SmearGate
Gating mechanism used in the architecture
parameters: null
OrthoInit
Orthogonal initialization used for some layers
parameters: null
VE128
VE128 module applied to layers 9 and 10
parameters: {"layers":[9,10]}
U-Net skip connections
Skip connections added in a U-Net style
parameters: null
LeakyReLU(0.5)²
LeakyReLU squared activation replacing ReLU² to preserve negative gradient flow
parameters: {"negative_slope":0.5}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0005}
LR Schedule
cosine decay
parameters: {"epochs":50,"formula":"lr *= 0.5 * (1 + cos(pi * progress))"}
Test-Time Training
full TTT
parameters: {"epochs":50,"learning_rate":0.0005,"weight_decay":0,"all_parameters_unfrozen":true,"per_layer_lr":{"mlp.proj":3,"mlp.fc":0.5},"grad_clip":1,"ddp_gradient_sync":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"tight"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
Initialization
OrthoInit
Orthogonal initialization
Novel Contributions
- LeakyReLU(0.5)² activation replacing ReLU²
- 50-epoch cosine test-time training with per-layer learning-rate groups
- Improved validation BPB to 1.0622, beating prior best validated score
- Combination of full #414 frontier stack with the new activation and TTT recipe