PR #1052

closed

Merge: Autoresearch/mar28 experiments on 4xH20

val_bpb
1.1978
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmdown_schedule":true}
Quantization
mixed int6
bits: 6
scope: artifact
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"decay":[0.995,0.997]}
Architecture
MLP width
Expanded MLP width from 3x to 3.5x
parameters: {"from":3,"to":3.5}
LeakyReLU
Used LeakyReLU squared activation
parameters: {"power":2,"slope":0.5}
BigramHash
Character bigram hash embeddings
parameters: {"dimensions":4096}
MLP4x
Removed bigram and used a larger MLP
parameters: null
MHA
Added full multi-head attention
parameters: {"kv_heads":8}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 8192
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Test-Time Training
full TTT
parameters: {"chunk":65536}

Novel Contributions

  • Muon optimizer tuning with weight decay, momentum, and warmdown schedule
  • Mixed-precision int6 quantization to fit the artifact under 16MB
  • Sliding window evaluation with stride 64
  • EMA weight averaging
  • BigramHash character embeddings
  • Sequence packing to 8192 tokens
  • MLP width expansion and LeakyReLU squared activation
  • Full multi-head attention with 8 KV heads
  • Test-time training with large chunk size