PR #811

open

Record: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB

by quietsmileView on GitHub

val_bpb

0.4377

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.9MB

Training Techniques

Architecture

XSA

Uses XSA on the last 4 layers.

parameters: {"layers":4}

MLP3x

3x MLP with LeakyReLU(0.5)^2.

parameters: null

KV head count

Uses 4 KV heads with 8 attention heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

mixed int6

bits: 6

scope: model weights

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.0005}

Weight Averaging

EMA

parameters: {"decay":0.998}

Evaluation

stride-based eval

parameters: {"stride":128}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0005,"epochs":4,"freeze_blocks":2,"temperature":0.98}

Sequence Length

sequence_length

train_length: null

eval_length: null

Other

other

Complementary training with bigram-weighted loss reweighting to focus learning on harder tokens.

parameters: {"complement_alpha":0.5}

other

BackoffNgramMixer with orders 2-10 and entropy-adaptive alpha mixing.

parameters: {"ngram_order":10,"alpha_base":0.2,"alpha_range":0.55,"alpha_center":3}

Compression

lzma

level: null

Novel Contributions

Complementary training with bigram-weighted loss reweighting
BackoffNgramMixer with entropy-adaptive alpha mixing
Legal score-first AdamW test-time training
Stride=128 evaluation optimization