val_bpb
1.1185
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.81 MB
Training Techniques
Architecture
GEPA
Attention mechanism used in the frontier architecture.
parameters: null
VE128
Architecture component included in the base model.
parameters: null
XSA
Cross/self-attention style modification applied to the last 4 layers.
parameters: {"layers":4}
SWA
Sliding window attention used in the architecture.
parameters: null
Late Soft-Round QAT
Late-stage quantization-aware training with soft rounding.
parameters: null
BigramHash
Bigram hashing module for token representation.
parameters: null
SmearGate
Gating mechanism used in the model.
parameters: null
Test-Time Training
score-first TTT with EB-adaptive per-layer scaling
parameters: {"freeze_embeddings":true,"burst_epochs":2,"burst_lr_multiplier":0.1,"layer_scale_formula":"clip(|E[grad_i]| / std(grad_i), 0.3, 3.0)"}
Weight Averaging
EMA
parameters: {"decay":0.9985}
Compression
zstd
level: null
Optimizer
SGD
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"burst_then_sliding_window_ttt":true}
Novel Contributions
- Empirical Bayes Adaptive Test-Time Training (EB-TTT) with per-layer gradient SNR scaling
- Layerwise adaptive TTT scaling using clipped gradient signal-to-noise ratio
- Embedding freeze during TTT to prevent vocabulary embedding distortion
- TTT burst with EMA before sliding-window TTT
- Diagnostic for distinguishing genuine TTT adaptation from memorization