PR #1435

open

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)

by AbhayAnandUCSDView on GitHub

val_bpb

1.0980

Architecture

Transformer

Optimizer

Muon

Artifact Size

~14.6 MB

Training Techniques

Architecture

depth recurrence

11 physical layers with layers 4 and 5 repeated once to create 13 virtual layers; recurrence activated at step 3000.

parameters: {"physical_layers":11,"virtual_layers":13,"repeat_layers":[4,5],"activate_step":3000}

BigramHash

Bigram hash embedding added on top of the recurrence base with SmearGate.

parameters: {"buckets":1536,"dim":112}

SmearGate

Gating mechanism used with BigramHash.

parameters: null

U-Net skip connections

Learnable residual gating on skip connections.

parameters: null

parallel residuals

Attention and MLP are run in parallel lanes in later layers.

parameters: {"layers":"7+"}

GQA

Grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Value Embedding

Value embedding added in later layers.

parameters: {"dim":128,"layers":[9,10]}

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dims":"16/64"}

XSA

XSA applied to all layers.

parameters: {"layers":11}

weight tying

Tied embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.09

momentum: 0.99

other_params: {"lr":0.02,"backend_steps":5}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.008,"role":"head"}

AdamW

weight_decay: 0.09

momentum: null

other_params: {"lr":0.6,"role":"embeddings"}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"lr":0.02,"role":"scalars"}

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.667}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

GPTQ

bits: 6

scope: all

Compression

Brotli

level: null

Novel Contributions

Depth recurrence with layers 4 and 5 repeated once to create 13 virtual layers
BigramHash(1536, dim 112) with SmearGate added on top of the recurrence base
EMA decay 0.9965 tuning
Combination of skip gates, parallel residuals, and MuonEq-R in the stack
GPTQ int6 quantization with Brotli compression to fit the artifact budget