PR #1166

open

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1347 (README request)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1347

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.20 MB

Training Techniques

Architecture

weight tying

Tied token embeddings / tied embeddings used in the model.

parameters: null

BigramHash

Bigram hash embedding component.

parameters: {"dimensions":1536}

XSA

Exclusive self-attention used in the last layers.

parameters: {"layers":4}

U-Net skip connections

Skip connections inspired by U-Net.

parameters: null

VE128

Value embeddings with 128-dimensional vectors.

parameters: {"dimensions":128}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

SmearGate

SmearGate gating mechanism.

parameters: null

LeakyReLU

MLP activation uses LeakyReLU squared variant.

parameters: {"slope":0.5}

FlowRefiner

Single-step latent-space flow-matching refiner applied after final LayerNorm before lm_head.

parameters: {"latent_dim":64,"hidden_dim":256}

Quantization

mixed int6/int8

bits: null

scope: per-row weights

late QAT

bits: null

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_start_step":6350}

Compression

lzma

level: 6

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

TTT-Linear

parameters: {"heads":8,"mini_batch":16,"learning_rate":1}

LR Schedule

warmdown

parameters: {"warmdown_steps":4311}

Regularization

LN scale

parameters: null

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

10-layer legal submission that fits under the 16 MB artifact cap
End-to-end TTT-Linear refinement combined with a 1-step FlowRefiner
FlowRefiner adapted from flow-matching ideas into a tiny hidden-state refiner
Three-variant size-quality comparison showing tradeoffs between depth, quantization, and budget
Prior 11-layer ablation study suggesting synergy between TTT-Linear and FlowRefiner