PR #254

open

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303

by timowhite88View on GitHub
val_bpb
1.1303
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: MLP+attention; embeddings int8; tied embeddings fp16
Architecture
MLP3x
3x expansion MLP with ReLU² activation in an 11-layer transformer
parameters: {"layers":11,"hidden_dim":1536,"heads":8,"kv_heads":4}
SmearGate
Learned sigmoid token blending gate
parameters: {"params":512}
BigramHash
2048-bucket hash embedding for token-pair features
parameters: {"buckets":2048,"dim":128}
RoPE
NTK-RoPE for long-context extrapolation
parameters: {"base":50000}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_steps":1500,"warmdown_steps":3000}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7,"phase":"warmdown"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"freezing_first_blocks":2}
Initialization
OrthoInit
Orthogonal initialization combined with muP
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":1500,"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
FlashAttention 3 used for attention computation
parameters: {"hardware":"Hopper"}

Novel Contributions

  • Test-time training (TTT) with full-weight SGD adaptation on validation data before scoring
  • 11-layer MLP3x transformer architecture with ReLU² activation
  • Mixed int6/int8 quantization with fp16 tied embeddings
  • SmearGate learned token blending gate
  • BigramHash token-pair feature embeddings
  • SWA checkpoint averaging during warmdown
  • NTK-RoPE for long-context extrapolation