PR #722

open

parameter golf submission - Julius

by magicjulioView on GitHub
val_bpb
0.5588
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,302,060 bytes

Training Techniques

Architecture
LeakyReLU
Replaced ReLU(x)^2 with LeakyReLU(x, 0.5)^2 in all MLP blocks to avoid dead neurons while keeping squared non-negative outputs.
parameters: {"negative_slope":0.5}
BigramHash
Expanded BigramHashEmbedding capacity to reduce hash collisions.
parameters: {"buckets":3072}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Test-Time Training
LoRA TTT
parameters: {"epochs":8}
Weight Averaging
SWA
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zlib
level: null

Novel Contributions

  • LeakyReLU(0.5)^2 activation replacement in MLP blocks
  • Increased training and TTT context length from 1024 to 2048
  • Expanded BigramHashEmbedding capacity from 2048 to 3072 buckets
  • Extended warmdown schedule from 3000 to 3500 iterations
  • LoRA-based test-time training with K-projection and Min-NLL epoch selection