PR #430

open

Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100

by sahiee-devView on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

Muon

Artifact Size

11.9MB

Training Techniques

Quantization

int5

bits: 5

scope: all

Architecture

SwiGLU

Replaced ReLU² MLP activation with SwiGLU using iso-parameter 2/3 hidden scaling.

parameters: {"hidden":938}

Value Residual

Adds a learned scalar multiple of the raw token embedding to each block output.

parameters: {"layers":10}

Gated Attention

Adds a learned per-layer scalar gate on attention output.

parameters: {"layers":10}

XSA

Exclusive Self Attention removes self-value bias from attention output via orthogonal projection in the last layers.

parameters: {"layers":4}

BigramHash

Uses hashed bigram token features as part of the input representation.

parameters: {"buckets":10240}

Weight Averaging

SWA

parameters: {"decay":0.4}

EMA

parameters: {"decay":0.9999}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

AdamW

weight_decay: null

momentum: null

other_params: {"ttt":true,"learning_rate":0.001,"betas":[0.9,0.999]}

Test-Time Training

AdamW TTT

parameters: {"epochs":3,"learning_rate":0.001,"betas":[0.9,0.999],"frozen_layers":6}

Initialization

OrthoInit

Orthogonal initialization.

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

zstd

level: 22

Novel Contributions

SwiGLU MLP replacing ReLU² with iso-parameter hidden scaling
Value Residual connections from raw token embeddings into each transformer block
Per-layer gated attention output scaling
Exclusive Self Attention (XSA) in the last 4 layers
Exponential Moving Average (EMA) of weights during training
AdamW-based test-time training over validation tokens
Restoring full-size BigramHash(10240) and dropping TrigramHash