PR #1820
openNon-record: Kitchen Sink — ACT vs Masked Recurrence, 8192-Bigram (val_bpb 1.4011)
by aiejvnView on GitHub
val_bpb
1.4011
Architecture
Transformer
Optimizer
Muon
Artifact Size
~10.9 MB
Training Techniques
Architecture
depth recurrence
Universal Transformer with 22 recurrent iterations and recurrence control via ACT or masked recurrence.
parameters: {"layers":11,"num_iters":22}
XSA
Cross-shaped attention variant used in the model.
parameters: null
BigramHash
8192-bigram bucket embedding for token pair features.
parameters: {"buckets":8192,"embed_dim":128}
Partial RoPE
Rotary positional embedding applied partially across dimensions.
parameters: {"base":10000,"dims":16}
LeakyReLU
LeakyReLU squared activation variant.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: model
Compression
zlib
level: 9
Test-Time Training
full TTT
parameters: {"enabled":true,"adaptation_epochs":10,"adapt_on":"val_tokens"}
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.04}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":6000}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Ablation comparing ACT versus masked recurrence on a kitchen-sink Universal Transformer baseline
- Demonstrates that ACT and masked recurrence converge to nearly identical validation BPB at this scale and budget
- Uses 8192-bigram buckets together with a Universal Transformer, XSA, LeakyReLU², EMA, late QAT, and Brotli-11 baseline
- Reports honest pre-TTT validation BPB while noting TTT-on-val behavior separately