PR #484

open

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)

by Robby955View on GitHub

val_bpb

1.1185

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.81 MB

Training Techniques

Architecture

GEPA

Attention mechanism used in the frontier architecture.

parameters: null

VE128

Architecture component included in the base model.

parameters: null

XSA

Cross/self-attention style modification applied to the last 4 layers.

parameters: {"layers":4}

SWA

Sliding window attention used in the architecture.

parameters: null

Late Soft-Round QAT

Late-stage quantization-aware training with soft rounding.

parameters: null

BigramHash

Bigram hashing module for token representation.

parameters: null

SmearGate

Gating mechanism used in the model.

parameters: null

Test-Time Training

score-first TTT with EB-adaptive per-layer scaling

parameters: {"freeze_embeddings":true,"burst_epochs":2,"burst_lr_multiplier":0.1,"layer_scale_formula":"clip(|E[grad_i]| / std(grad_i), 0.3, 3.0)"}

Weight Averaging

EMA

parameters: {"decay":0.9985}

Compression

zstd

level: null

Optimizer

SGD

weight_decay: null

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"burst_then_sliding_window_ttt":true}

Novel Contributions

Empirical Bayes Adaptive Test-Time Training (EB-TTT) with per-layer gradient SNR scaling
Layerwise adaptive TTT scaling using clipped gradient signal-to-noise ratio
Embedding freeze during TTT to prevent vocabulary embedding distortion
TTT burst with EMA before sliding-window TTT
Diagnostic for distinguishing genuine TTT adaptation from memorization