PR #1539

open

Record: SP8192 + Pre-Quant AdamW TTT + Compiled TTT — val_bpb 1.0587 (3-seed mean)

by translatingthenameView on GitHub

val_bpb

1.0587

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.5 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer recurrence with repeated layers to create virtual depth

parameters: {"layers":3,"virtual_layers":14,"physical_layers":11}

Parallel Residuals

GPT-J style two-lane residual path where attention and MLP operate independently and are merged later

parameters: {"start_layer":7}

XSA

XSA applied across all layers for efficient GQA-aware attention

parameters: {"layers":11}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions

parameters: {"dimensions":16,"total_dimensions":64}

weight tying

Tied input and output embeddings

parameters: null

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"negative_slope":0.5}

SmearGate

SmearGate mechanism included in the architecture

parameters: null

U-Net skip connections

Sigmoid-gated U-Net style skip connections

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Value Embeddings

Value embeddings added to the model

parameters: {"dimension":44,"layers":[9,10]}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":4}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"used_for":"embeddings and scalars","embedding_lr":0.03,"scalar_lr":0.02}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

Brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":6,"learning_rate":0.0005,"freeze_blocks":2,"compiled":true}

LR Schedule

cosine decay

parameters: {"final_lr_factor":0.1}

warmdown

parameters: {"warmdown_fraction":0.72}

Regularization

weight decay

parameters: {"value":0.095}

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale_rule":"1/sqrt(layer+1)"}

Novel Contributions

Pre-quant AdamW test-time training baked into the final artifact
Compiled TTT with torch.compile for faster validation fine-tuning
SP8192 with GPTQ SDClip quantization using mixed int6/int8 precision
3-layer depth recurrence producing 14 virtual layers from 11 physical layers
Parallel residual architecture with GPT-J style two-lane merging
Combined MuonEq-R training with EMA, warmdown, and tuned QK gain