PR #1255
openNon-record Submission: Text Diffusion + Retrodiction + TTT + Depth Recurrence
by akaiHuangView on GitHub
val_bpb
1.5080
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.8MB
Training Techniques
Other
other
Text diffusion / CDM with sequential unmasking evaluation
parameters: null
other
AR retrodiction training on reversed sequences with causal attention
parameters: {"alpha":0.3,"interval":4}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW"}
Architecture
depth recurrence
Depth recurrence experiments
parameters: null
BigramHash
Bigram hash embeddings with 2048 buckets
parameters: {"buckets":2048}
SmearGate
SmearGate gating mechanism
parameters: null
LeakyReLU
LeakyReLU squared activation variant
parameters: {"slope":0.5}
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
Weight Averaging
EMA
parameters: {"decay":0.997,"start":"80% of training"}
LR Schedule
warmdown
parameters: {"warmdown_steps":150}
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Novel Contributions
- Retrodiction training using reversed sequences as an auxiliary loss
- Application of the Petz recovery map idea to language model training
- Text diffusion / CDM with sequential unmasking evaluation
- Test-time training with full-model AdamW
- Depth recurrence experiments
- Custom v4096 tokenizer