PR #1255

open

Non-record Submission: Text Diffusion + Retrodiction + TTT + Depth Recurrence

by akaiHuangView on GitHub

val_bpb

1.5080

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.8MB

Training Techniques

Other

other

Text diffusion / CDM with sequential unmasking evaluation

parameters: null

other

AR retrodiction training on reversed sequences with causal attention

parameters: {"alpha":0.3,"interval":4}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW"}

Architecture

depth recurrence

Depth recurrence experiments

parameters: null

BigramHash

Bigram hash embeddings with 2048 buckets

parameters: {"buckets":2048}

SmearGate

SmearGate gating mechanism

parameters: null

LeakyReLU

LeakyReLU squared activation variant

parameters: {"slope":0.5}

XSA

XSA applied to the last 4 layers

parameters: {"layers":4}

Weight Averaging

EMA

parameters: {"decay":0.997,"start":"80% of training"}

LR Schedule

warmdown

parameters: {"warmdown_steps":150}

Quantization

int6

bits: 6

scope: all

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 8192

eval_length: null

Novel Contributions

Retrodiction training using reversed sequences as an auxiliary loss
Application of the Petz recovery map idea to language model training
Text diffusion / CDM with sequential unmasking evaluation
Test-time training with full-model AdamW
Depth recurrence experiments
Custom v4096 tokenizer