PR #1183

open

Non-record: Retrodiction Training (Petz Recovery Map) — val_bpb 1.508

by akaiHuangView on GitHub

val_bpb

1.5080

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.8MB

Training Techniques

Architecture

MLP3x

Transformer with 3x MLP expansion

parameters: null

XSA

Applied to the last 4 layers

parameters: {"layers":4}

BigramHash

Bigram hash features with smear gate

parameters: {"buckets":2048}

SmearGate

Used alongside BigramHash

parameters: null

LeakyReLU

LeakyReLU squared activation

parameters: {"negative_slope":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"embeddings_scalars_optimizer":"AdamW"}

Weight Averaging

EMA

parameters: {"decay":0.997,"start":0.8}

LR Schedule

warmdown

parameters: {"warmdown_steps":150}

Quantization

int6

bits: 6

scope: all

Compression

lzma

level: null

Other

other

Retrodiction auxiliary loss on reversed sequences inspired by the Petz recovery map, combined with forward autoregressive loss

parameters: {"alpha":0.3,"applied_every_steps":4}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Novel Contributions

Retrodiction auxiliary training loss inspired by the Petz recovery map
Training on both forward and reversed sequences while maintaining causal attention
Reported 1-3.6% BPB improvement over pure autoregressive training at matched token counts
Zero inference cost because the method is training-only