PR #254
openRecord: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303
by timowhite88View on GitHub
val_bpb
1.1303
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: MLP+attention; embeddings int8; tied embeddings fp16
Architecture
MLP3x
3x expansion MLP with ReLU² activation in an 11-layer transformer
parameters: {"layers":11,"hidden_dim":1536,"heads":8,"kv_heads":4}
SmearGate
Learned sigmoid token blending gate
parameters: {"params":512}
BigramHash
2048-bucket hash embedding for token-pair features
parameters: {"buckets":2048,"dim":128}
RoPE
NTK-RoPE for long-context extrapolation
parameters: {"base":50000}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_steps":1500,"warmdown_steps":3000}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7,"phase":"warmdown"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"freezing_first_blocks":2}
Initialization
OrthoInit
Orthogonal initialization combined with muP
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":1500,"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
FlashAttention 3 used for attention computation
parameters: {"hardware":"Hopper"}
Novel Contributions
- Test-time training (TTT) with full-weight SGD adaptation on validation data before scoring
- 11-layer MLP3x transformer architecture with ReLU² activation
- Mixed int6/int8 quantization with fp16 tied embeddings
- SmearGate learned token blending gate
- BigramHash token-pair feature embeddings
- SWA checkpoint averaging during warmdown
- NTK-RoPE for long-context extrapolation