PR #825

open

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)

val_bpb

0.5440

Architecture

11-layer Transformer

Optimizer

—

Artifact Size

16.0 MB

Training Techniques

Architecture

XSA-all

Uses XSA-all attention mechanism in the transformer.

parameters: null

MLP3.5x

Expanded MLP width to 3.5x.

parameters: {"mlp_multiplier":3.5}

LeakyReLU

Uses LeakyReLU(0.5)^2 activation.

parameters: {"negative_slope":0.5,"power":2}

Quantization

int5

bits: 5

scope: all

Weight Averaging

EMA

parameters: null

SWA

parameters: {"type":"Tight SWA"}

Evaluation

order-adaptive entropy-gated BackoffNgramMixer

parameters: {"orders":"2-7 gram","per_order_entropy_thresholds":true,"score_first":true,"backward_looking":true,"deterministic":true}

Test-Time Training

score-first TTT

parameters: {"backward_looking":true}

Compression

custom

level: null