PR #798

open

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)

by travispchenView on GitHub

val_bpb

0.5466

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

XSA

Uses XSA in all 11 layers as part of the base model stack.

parameters: {"layers":11}

MLP3x

Three MLP blocks with LeakyReLU(0.5)^2.

parameters: {"mlp_blocks":3}

GQA

Grouped-query attention setting with no grouping.

parameters: {"kv_heads":8,"query_heads":8}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adamw_for":"embeddings/scalars"}

Weight Averaging

EMA + SWA

parameters: null

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.00003,"epochs":1,"chunk_tokens":1000000,"freeze_blocks":2,"polyak_decay":0.998}

Regularization

pruning

parameters: {"sparsity":0.03,"type":"magnitude"}

Other

other

Backoff n-gram mixer with entropy-adaptive alpha mixing across orders 2-7.

parameters: {"orders":[2,3,4,5,6,7]}

other

Order-adaptive entropy gating using per-order entropy centers for n-gram mixing.

parameters: {"entropy_centers":{"2":4.5,"3":4.2,"4":3.8,"5":3.5,"6":3.2,"7":3}}

other

Drift-free test-time training with logistic context mixer.

parameters: {"eta":0.1}

Novel Contributions

Order-adaptive entropy gating with per-n-gram-order entropy centers
BackoffNgramMixer combining n-gram predictions with neural predictions
Drift-free score-first test-time training
Entropy-adaptive alpha mixing across n-gram orders