PR #731

open

Record: 1.0400 BPB -- Hedge Mixer + VRL + AdamW TTT + Polyak EMA

by pentxaycView on GitHub

val_bpb

1.0400

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,999,919 bytes

Training Techniques

Architecture

VRL

Value Residual Learning with a residual connection from layer 0's value output to all subsequent layers

parameters: null

LeakyReLU

LeakyReLU activation squared

parameters: {"negative_slope":0.5,"power":2}

XSA-4

Cross-Token Self-Attention applied on the last 4 layers

parameters: {"layers":4}

tied embeddings

Input and output embeddings are tied

parameters: null

GQA

Grouped-query attention with 8 query heads and 4 KV heads

parameters: {"query_heads":8,"kv_heads":4}

BigramHash

Hashed bigram feature table used in the model

parameters: {"buckets":2048}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"learning_rate":0.0005,"test_time_training":true}

Weight Averaging

Polyak averaging

parameters: {"decay":0.998}

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"optimizer":"AdamW","learning_rate":0.0005,"polyak_decay":0.998,"freeze_first_blocks":9,"unfreeze_last_blocks":2,"epochs_per_chunk":3,"byte_weighted_loss":true,"adaptive_cosine_lr":true}

LR Schedule

adaptive cosine decay

parameters: {"ramp_multiplier_start":1,"ramp_multiplier_end":3,"ramp_fraction":0.3}

Other

other

Hedge Mixer online ensemble combining neural predictions with unigram, bigram, trigram, and entropy experts using multiplicative weights

parameters: {"experts":5,"eta":0.1,"deferred_updates":true}

Novel Contributions

Value Residual Learning (VRL) in the transformer
5-expert Hedge Mixer during evaluation
Deferred score-first Hedge weight updates
AdamW test-time training with Polyak EMA
Byte-weighted loss for TTT
Adaptive cosine learning rate during TTT
Freeze-first-9-blocks / unfreeze-last-2-blocks TTT scheme
Int6 mixed quantization with lzma compression