PR #849
openRecord: 11L Int5 + 6-Expert HedgeMixer + LeakyReLU(0.9)^2 + TTT (val_bpb=1.1105)
by dttdrvView on GitHub
val_bpb
1.1105
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.95 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all blocks
Architecture
MLP3x
3.5x MLP with LeakyReLU(0.9)^2 activation
parameters: {"hidden":1792}
tied embeddings
Input and output embeddings are tied
parameters: null
GatedAttention
Per-head learned scalar gate in attention
parameters: null
ValueResidual
Per-block learned x0 injection / residual value path
parameters: null
XSA
XSA applied on all 11 layers
parameters: {"layers":11}
SmearGate
Additional gating mechanism used in the model
parameters: null
BigramHash
Hashed bigram feature embedding
parameters: {"size":8192,"dim":128}
ValueEmbedding
Value embedding used on later layers
parameters: {"dim":128,"layers":[9,10]}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":"16/64"}
LN Scale
LayerNorm scaling modification
parameters: null
U-Net encoder-decoder
U-Net style encoder-decoder with learned skip weights
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_schedule":"0.92->0.99","warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA","scale_threshold":0.2,"every_n_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"optimizer":"AdamW","learning_rate":0.0005,"freeze_blocks":2,"byte_weighted":true,"polyak_averaging":0.998,"adaptive_cosine_lr":true}
Initialization
OrthoInit
Used with muP initialization scheme
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3500}
Regularization
weight decay
parameters: {"value":0.04}
CROWN-Q
parameters: null
Late QAT soft-round STE
parameters: {"quantization_bits":5,"scale_threshold":0.5}
Other
other
6-expert Hedge context mixer combining neural, unigram, bigram, trigram, 4-gram, and entropy experts
parameters: {"experts":6}
Novel Contributions
- 6-expert HedgeMixer context mixer
- LeakyReLU(0.9)^2 activation
- GatedAttention with per-head learned scalar gates
- ValueResidual and XSA across all 11 layers
- Partial RoPE with 16/64 dimensions
- BigramHash and ValueEmbedding features
- Late QAT soft-round with CROWN-Q regularization
- GPTQ int5 quantization with 3% pruning and zstd-22 compression
- Score-first TTT with byte-weighted loss and Polyak averaging