PR #1693

open

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0573
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.21 MB

Training Techniques

Architecture
Attention Output Gate
Per-head multiplicative gate on attention outputs, zero-initialized so heads pass through at scale 1.0 at init.
parameters: {"layers":11,"heads":8,"width":12}
SmearGate
Input-dependent residual mixer that blends the current token with the previous token in a causal, backward-looking way.
parameters: {"width":12}
depth recurrence
Recurrent layer reuse with encoder/decoder layer loops.
parameters: {"layers":11}
U-Net skip connections
Skip connections used in the architecture.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
LeakyReLU squared MLP activation variant.
parameters: {"negative_slope":0.5}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"row_normalized":true,"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 7
scope: embeddings
Compression
brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000,"learning_rate":0.001,"momentum":0.9,"gradient_clip":1}
LoRA TTT
parameters: {"rank":96,"learning_rate":0.0001,"chunk":48,"batch_size":64}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.75}
Evaluation
sliding window eval
parameters: {"causal":true}

Novel Contributions

  • Attention Output Gate applied per head across all attention paths with zero initialization
  • SmearGate residual mixer combined with the casefold V4 baseline
  • Multi-phase global SGD score-first TTT with phased prefix scoring
  • Casefold V4 tokenizer retrained from scratch on lowercased data
  • Record-setting improvement to val_bpb 1.05733 with 3-seed mean