PR #1693

open

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0573

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.21 MB

Training Techniques

Architecture

Attention Output Gate

Per-head multiplicative gate on attention outputs, zero-initialized so heads pass through at scale 1.0 at init.

parameters: {"layers":11,"heads":8,"width":12}

SmearGate

Input-dependent residual mixer that blends the current token with the previous token in a causal, backward-looking way.

parameters: {"width":12}

depth recurrence

Recurrent layer reuse with encoder/decoder layer loops.

parameters: {"layers":11}

U-Net skip connections

Skip connections used in the architecture.

parameters: null

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

LeakyReLU squared MLP activation variant.

parameters: {"negative_slope":0.5}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"row_normalized":true,"newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 7

scope: embeddings

Compression

brotli

level: 11

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000,"learning_rate":0.001,"momentum":0.9,"gradient_clip":1}

LoRA TTT

parameters: {"rank":96,"learning_rate":0.0001,"chunk":48,"batch_size":64}

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.75}

Evaluation

sliding window eval

parameters: {"causal":true}

Novel Contributions

Attention Output Gate applied per head across all attention paths with zero initialization
SmearGate residual mixer combined with the casefold V4 baseline
Multi-phase global SGD score-first TTT with phased prefix scoring
Casefold V4 tokenizer retrained from scratch on lowercased data
Record-setting improvement to val_bpb 1.05733 with 3-seed mean