PR #267

open

Record: val_bpb: 1.14020 [tested 3x on 8xh100]

by andrewgcodesView on GitHub
val_bpb
1.1374
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,516,237 bytes

Training Techniques

Quantization
int5
bits: 5
scope: all weights
fp16
bits: 16
scope: tied embeddings and last-layer key projections
Architecture
XSA
Exclusive self-attention applied to the last 3 layers by subtracting self-value projection from attention output
parameters: {"layers":3}
SmearGate
Uses SmearGate in the architecture
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Grouped-query attention with 4 KV heads
parameters: {"kv_heads":4,"heads":8}
MLP3x
MLP uses 3x expansion
parameters: {"hidden_size":1536}
Optimizer
Muon
weight_decay: 0.08
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":12,"chunks":64,"learning_rate":0.004,"momentum":0.9}
Initialization
OrthoInit
Orthogonal initialization with scaled output projections
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.08}
magnitude pruning
parameters: {"sparsity":"3%"}
Compression
zstd
level: null

Novel Contributions

  • Causal test-time training that evaluates each chunk first and trains only on already-scored tokens
  • Int5 quantization applied to all weight categories to fit the model under the artifact size limit
  • EMA-based training for improved model averaging
  • Exclusive self-attention applied to the last 3 layers
  • Orthogonal initialization with scaled output projections
  • Sliding-window evaluation with stride 64
  • Post quantization roundtrip using int5 + zstd