PR #1166
openNon-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1347 (README request)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1347
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.20 MB
Training Techniques
Architecture
weight tying
Tied token embeddings / tied embeddings used in the model.
parameters: null
BigramHash
Bigram hash embedding component.
parameters: {"dimensions":1536}
XSA
Exclusive self-attention used in the last layers.
parameters: {"layers":4}
U-Net skip connections
Skip connections inspired by U-Net.
parameters: null
VE128
Value embeddings with 128-dimensional vectors.
parameters: {"dimensions":128}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
SmearGate
SmearGate gating mechanism.
parameters: null
LeakyReLU
MLP activation uses LeakyReLU squared variant.
parameters: {"slope":0.5}
FlowRefiner
Single-step latent-space flow-matching refiner applied after final LayerNorm before lm_head.
parameters: {"latent_dim":64,"hidden_dim":256}
Quantization
mixed int6/int8
bits: null
scope: per-row weights
late QAT
bits: null
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_start_step":6350}
Compression
lzma
level: 6
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT-Linear
parameters: {"heads":8,"mini_batch":16,"learning_rate":1}
LR Schedule
warmdown
parameters: {"warmdown_steps":4311}
Regularization
LN scale
parameters: null
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- 10-layer legal submission that fits under the 16 MB artifact cap
- End-to-end TTT-Linear refinement combined with a 1-step FlowRefiner
- FlowRefiner adapted from flow-matching ideas into a tiny hidden-state refiner
- Three-variant size-quality comparison showing tradeoffs between depth, quantization, and budget
- Prior 11-layer ablation study suggesting synergy between TTT-Linear and FlowRefiner