PR #254

open

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303

by timowhite88View on GitHub

val_bpb

1.1303

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88 MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: MLP+attention; embeddings int8; tied embeddings fp16

Architecture

MLP3x

3x expansion MLP with ReLU² activation in an 11-layer transformer

parameters: {"layers":11,"hidden_dim":1536,"heads":8,"kv_heads":4}

SmearGate

Learned sigmoid token blending gate

parameters: {"params":512}

BigramHash

2048-bucket hash embedding for token-pair features

parameters: {"buckets":2048,"dim":128}

RoPE

NTK-RoPE for long-context extrapolation

parameters: {"base":50000}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_steps":1500,"warmdown_steps":3000}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":7,"phase":"warmdown"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"freezing_first_blocks":2}

Initialization

OrthoInit

Orthogonal initialization combined with muP

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":1500,"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

FlashAttention 3 used for attention computation

parameters: {"hardware":"Hopper"}

Novel Contributions

Test-time training (TTT) with full-weight SGD adaptation on validation data before scoring
11-layer MLP3x transformer architecture with ReLU² activation
Mixed int6/int8 quantization with fp16 tied embeddings
SmearGate learned token blending gate
BigramHash token-pair feature embeddings
SWA checkpoint averaging during warmdown
NTK-RoPE for long-context extrapolation