PR #462

closed

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)

by JoeProAIView on GitHub

val_bpb

1.0672

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

SwiGLU FFN

Feed-forward network uses SwiGLU with Star-ReLU activation.

parameters: null

U-Net

U-Net-style skip connections with learned gating.

parameters: null

BigramHash

BigramHash embeddings for token representation.

parameters: {"buckets":8192,"dimension":128}

SmearGate

SmearGate applied on embeddings.

parameters: null

Partial RoPE

Rotary positional embeddings applied only partially.

parameters: {"dimensions":16}

KV head count

Uses 8 attention heads and 8 KV heads.

parameters: {"heads":8,"kv_heads":8}

weight tying

Tied embeddings.

parameters: null

XSA

Cross-sequence attention on the last 4 layers.

parameters: {"layers":4}

Weight Averaging

EMA

parameters: {"decay":0.9985}

Test-Time Training

AdamW TTT

parameters: {"learning_rate":0.0005,"epochs":10,"weight_decay":0}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Quantization

int6

bits: 6

scope: all

Late QAT

bits: null

scope: all

LR Schedule

warmdown

parameters: {"steps":6000}

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

Novel Contributions

SwiGLU FFN with Star-ReLU activation
U-Net skip connections with learned gating
BigramHash embeddings
SmearGate on embeddings
GEPA-discovered architecture search result
Combination of XSA4, EMA, AdamW TTT, Partial RoPE, LN Scale, and Late QAT
Int6 quantization with zstd-22 compression