PR #1507

open

[RECORD] L-BFGS SLOT + Entropy-Adaptive N-gram Mixer (0.2282 BPB)

by ChideraIbe123View on GitHub

val_bpb

0.2282

Architecture

Transformer

Optimizer

L-BFGS

Artifact Size

~15.75 MB

Training Techniques

Optimizer

L-BFGS

weight_decay: null

momentum: null

other_params: {"history_size":10,"line_search":"strong Wolfe","steps":6}

Architecture

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"negative_slope":0.5}

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

SmearGate

SmearGate component in the architecture

parameters: null

BigramHash

Bigram hash feature/module

parameters: null

XSA

XSA-all attention component

parameters: null

Weight Averaging

EMA + SWA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Test-Time Training

score-first SLOT

parameters: {"frozen_model":true,"no_grad_hidden_states":true}

Other

other

Entropy-adaptive n-gram mixer with target-independent mixing features (entropy, match order, context count)

parameters: {"order":12,"hash_buckets":4000000}

Novel Contributions

L-BFGS optimization for SLOT instead of AdamW
Entropy-adaptive n-gram mixer with target-independent features
Order-12 vectorized n-gram backoff with 4M hash buckets
Strong Wolfe line search with limited L-BFGS history
Combination of EMA, SWA, late QAT, GPTQ int6, and lzma compression