PR #430

open

Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100

by sahiee-devView on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.9MB

Training Techniques

Quantization
int5
bits: 5
scope: all
Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU using iso-parameter 2/3 hidden scaling.
parameters: {"hidden":938}
Value Residual
Adds a learned scalar multiple of the raw token embedding to each block output.
parameters: {"layers":10}
Gated Attention
Adds a learned per-layer scalar gate on attention output.
parameters: {"layers":10}
XSA
Exclusive Self Attention removes self-value bias from attention output via orthogonal projection in the last layers.
parameters: {"layers":4}
BigramHash
Uses hashed bigram token features as part of the input representation.
parameters: {"buckets":10240}
Weight Averaging
SWA
parameters: {"decay":0.4}
EMA
parameters: {"decay":0.9999}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: {"ttt":true,"learning_rate":0.001,"betas":[0.9,0.999]}
Test-Time Training
AdamW TTT
parameters: {"epochs":3,"learning_rate":0.001,"betas":[0.9,0.999],"frozen_layers":6}
Initialization
OrthoInit
Orthogonal initialization.
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zstd
level: 22

Novel Contributions

  • SwiGLU MLP replacing ReLU² with iso-parameter hidden scaling
  • Value Residual connections from raw token embeddings into each transformer block
  • Per-layer gated attention output scaling
  • Exclusive Self Attention (XSA) in the last 4 layers
  • Exponential Moving Average (EMA) of weights during training
  • AdamW-based test-time training over validation tokens
  • Restoring full-size BigramHash(10240) and dropping TrigramHash