PR #1183
openNon-record: Retrodiction Training (Petz Recovery Map) — val_bpb 1.508
by akaiHuangView on GitHub
val_bpb
1.5080
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.8MB
Training Techniques
Architecture
MLP3x
Transformer with 3x MLP expansion
parameters: null
XSA
Applied to the last 4 layers
parameters: {"layers":4}
BigramHash
Bigram hash features with smear gate
parameters: {"buckets":2048}
SmearGate
Used alongside BigramHash
parameters: null
LeakyReLU
LeakyReLU squared activation
parameters: {"negative_slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"embeddings_scalars_optimizer":"AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997,"start":0.8}
LR Schedule
warmdown
parameters: {"warmdown_steps":150}
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null
Other
other
Retrodiction auxiliary loss on reversed sequences inspired by the Petz recovery map, combined with forward autoregressive loss
parameters: {"alpha":0.3,"applied_every_steps":4}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Novel Contributions
- Retrodiction auxiliary training loss inspired by the Petz recovery map
- Training on both forward and reversed sequences while maintaining causal attention
- Reported 1-3.6% BPB improvement over pure autoregressive training at matched token counts
- Zero inference cost because the method is training-only