PR #430
openValue Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100
by sahiee-devView on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.9MB
Training Techniques
Quantization
int5
bits: 5
scope: all
Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU using iso-parameter 2/3 hidden scaling.
parameters: {"hidden":938}
Value Residual
Adds a learned scalar multiple of the raw token embedding to each block output.
parameters: {"layers":10}
Gated Attention
Adds a learned per-layer scalar gate on attention output.
parameters: {"layers":10}
XSA
Exclusive Self Attention removes self-value bias from attention output via orthogonal projection in the last layers.
parameters: {"layers":4}
BigramHash
Uses hashed bigram token features as part of the input representation.
parameters: {"buckets":10240}
Weight Averaging
SWA
parameters: {"decay":0.4}
EMA
parameters: {"decay":0.9999}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: {"ttt":true,"learning_rate":0.001,"betas":[0.9,0.999]}
Test-Time Training
AdamW TTT
parameters: {"epochs":3,"learning_rate":0.001,"betas":[0.9,0.999],"frozen_layers":6}
Initialization
OrthoInit
Orthogonal initialization.
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zstd
level: 22
Novel Contributions
- SwiGLU MLP replacing ReLU² with iso-parameter hidden scaling
- Value Residual connections from raw token embeddings into each transformer block
- Per-layer gated attention output scaling
- Exclusive Self Attention (XSA) in the last 4 layers
- Exponential Moving Average (EMA) of weights during training
- AdamW-based test-time training over validation tokens
- Restoring full-size BigramHash(10240) and dropping TrigramHash