PR #1014

open

N-gram logit boost + HedgeMixer + score-first TTT

by haimianbaobao007View on GitHub

val_bpb

1.6200

Architecture

Transformer

Optimizer

SGD

Artifact Size

—

Training Techniques

Architecture

BigramHash

N-gram logit boost using hashed n-gram tables with normalized softmax-based boosting.

parameters: null

Other

other

HedgeMixer online multiplicative-weights mixing between neural and neural+ngram experts.

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.95

other_params: {"per_layer_lr":true}

Weight Averaging

Polyak averaging

parameters: null

Test-Time Training

score-first TTT

parameters: null

Regularization

logit bias

parameters: {"per_document":true}

Evaluation

sliding window eval

parameters: null

Quantization

int6

bits: 6

scope: all