PR #588

open

[WIP][non-record] 8L/448 width branch local results

val_bpb

1.4120

Architecture

Transformer

Optimizer

Muon

Artifact Size

7.06MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

tied embeddings

Embedding weights are not tied (TIE_EMBEDDINGS=0)

parameters: {"TIE_EMBEDDINGS":0}

KV head count

Number of key-value heads set to 2

parameters: {"NUM_KV_HEADS":2}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Evaluation

stride-based eval

parameters: {"stride":64,"eval_batch_seqs":256}

Test-Time Training

TTT

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_iters":300}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Widening model width from 384 to 448 at 8 layers outperforms deeper 9-layer 384-width model
Test-time training (TTT) provides modest improvements but width scaling is the dominant factor
Compact model scaling under 16MB artifact size limit with int8 quantization and zlib compression