PR #484

open

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)

by Robby955View on GitHub
val_bpb
1.1185
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.81 MB

Training Techniques

Architecture
GEPA
Attention mechanism used in the frontier architecture.
parameters: null
VE128
Architecture component included in the base model.
parameters: null
XSA
Cross/self-attention style modification applied to the last 4 layers.
parameters: {"layers":4}
SWA
Sliding window attention used in the architecture.
parameters: null
Late Soft-Round QAT
Late-stage quantization-aware training with soft rounding.
parameters: null
BigramHash
Bigram hashing module for token representation.
parameters: null
SmearGate
Gating mechanism used in the model.
parameters: null
Test-Time Training
score-first TTT with EB-adaptive per-layer scaling
parameters: {"freeze_embeddings":true,"burst_epochs":2,"burst_lr_multiplier":0.1,"layer_scale_formula":"clip(|E[grad_i]| / std(grad_i), 0.3, 3.0)"}
Weight Averaging
EMA
parameters: {"decay":0.9985}
Compression
zstd
level: null
Optimizer
SGD
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"burst_then_sliding_window_ttt":true}

Novel Contributions

  • Empirical Bayes Adaptive Test-Time Training (EB-TTT) with per-layer gradient SNR scaling
  • Layerwise adaptive TTT scaling using clipped gradient signal-to-noise ratio
  • Embedding freeze during TTT to prevent vocabulary embedding distortion
  • TTT burst with EMA before sliding-window TTT
  • Diagnostic for distinguishing genuine TTT adaptation from memorization