PR #825

open

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)

by hypery11View on GitHub
val_bpb
0.5440
Architecture
11-layer Transformer
Optimizer
Artifact Size
16.0 MB

Training Techniques

Architecture
XSA-all
Uses XSA-all attention mechanism in the transformer.
parameters: null
MLP3.5x
Expanded MLP width to 3.5x.
parameters: {"mlp_multiplier":3.5}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation.
parameters: {"negative_slope":0.5,"power":2}
Quantization
int5
bits: 5
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: {"type":"Tight SWA"}
Evaluation
order-adaptive entropy-gated BackoffNgramMixer
parameters: {"orders":"2-7 gram","per_order_entropy_thresholds":true,"score_first":true,"backward_looking":true,"deterministic":true}
Test-Time Training
score-first TTT
parameters: {"backward_looking":true}
Compression
custom
level: null

Novel Contributions

  • Order-adaptive entropy-gated BackoffNgramMixer
  • Per-order entropy thresholds for mixing weight selection
  • Score-first, backward-looking, deterministic evaluation strategy
  • 11-layer transformer with XSA-all and full MHA
  • int5 quantization with compression
  • EMA and Tight SWA training recipe