PR #79

open

Depth Recurrence: 3x3x1024 (non-record, pending H100)

val_bpb

1.8698

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.7MB

Training Techniques

Architecture

depth recurrence

3 unique transformer blocks are repeated 3 times for an effective depth of 9, reusing blocks across repeats without U-Net skip connections.

parameters: {"unique_blocks":3,"repeats":3,"effective_depth":9,"dim":1024}

tied embeddings

Input and output embeddings are tied to reduce parameters.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":24,"kv_heads":12}

Quantization

QAT

bits: 6

scope: all

Optimizer

Muon

weight_decay: null

momentum: 0.85

other_params: {"matrix_lr":0.02,"muon_backend_steps":7,"qk_gain_init":2,"qk_gain":2}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

NorMuon training variant used alongside Int6 QAT.

parameters: null

Depth recurrence with 3 unique transformer blocks repeated 3 times
Trading architectural diversity for width to fit a larger dimension within the parameter budget
Int6 QAT to increase parameter capacity within the 16MB artifact budget
Use of NorMuon, which reportedly improved BPB
Sliding window evaluation with stride 64
Systematic search over multiple architectural strategies and hyperparameters