PR #1057
open11L MLP2x + LeakyReLU² + Legal TTT (val_bpb=1.2201, 3-seed mean, std=0.0015)
by ProgrammerryokiView on GitHub
val_bpb
1.2201
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.0 MB
Training Techniques
Architecture
LeakyReLU
MLP uses LeakyReLU(0.5) squared in a 2x MLP block.
parameters: {"mlp_mult":2,"negative_slope":0.5,"squared":true}
BigramHash
Bigram hash embedding with 4096 buckets.
parameters: {"buckets":4096}
SmearGate
SmearGate enabled in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections enabled.
parameters: null
XSA
XSA applied in the last 4 layers.
parameters: {"layers":4}
weight tying
Input and output embeddings are tied.
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
STE QAT
bits: 6
scope: all
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":7,"chunk_size":32768,"all_blocks_unfrozen":true}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"adamw":true,"lr":0.025}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- LeakyReLU(0.5) squared MLP activation
- Legal score-first TTT with 7 epochs per chunk
- Combination of BigramHash, SmearGate, U-Net skips, and XSA in a compact 11-layer model
- Int6 QAT plus GPTQ-lite compression to fit under the 16MB artifact limit