PR #1626
openRecord: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0719
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.93 MB
Training Techniques
Architecture
LeakyReLU
Fused MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"slope":0.5}
VarLen Attention
Variable-length causal attention using flash_attn_varlen_func.
parameters: {"causal":true}
MLP3x
Fused MLP with 4x hidden expansion and Triton fusion.
parameters: {"hidden_multiplier":4}
depth recurrence
3-layer recurrence loop in the middle layers.
parameters: {"layers":3}
U-Net skip connections
Encoder-decoder skip connections.
parameters: null
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16,"base_dimensions":64}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: full model + int7 embeddings
int7
bits: 7
scope: embeddings
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"phased":true,"num_phases":3,"prefix_docs":2000}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Regularization
adaptive clip
parameters: {"mlp_sigmas":12,"attn_sigmas":13,"embed_sigmas":15}
Novel Contributions
- Multi-phase global SGD during phased TTT evaluation
- Three-phase score-before-update adaptation over prefix documents
- Combination of VarLen attention, fused MLP, and phased TTT
- Trimmed GPTQ and int7 embeddings for improved size/performance tradeoff
- Per-layer adaptive clipping and tuned MATRIX_LR