PR #1661

open

Non-record: 11L DepthRec PolarNS SWA

by anderamondarainh-stackView on GitHub

val_bpb

1.1444

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,999,891 bytes

Training Techniques

Architecture

depth recurrence

Reuses MLP blocks across passes with learned scalar gating per reused pass.

parameters: {"reused_blocks":[4,5],"source_block":3}

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Adds a bigram hash embedding feature.

parameters: {"buckets":3072,"dim":112}

XSA

Uses XSA in the deepest layers.

parameters: {"layers":4}

weight tying

Ties input and output embeddings.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

parallel residuals

Uses parallel residual connections in later layers.

parameters: {"start_layer":7}

Quantization

late QAT

bits: 6

scope: MLP and attention 2D weights

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"polar_express_coefficients":true,"aol_preconditioning":true,"newton_schulz_iters":5}

Adam

weight_decay: null

momentum: null

other_params: {"scope":"scalars and embeddings"}

Weight Averaging

EMA + SWA

parameters: {"swa_start_scale":0.2,"swa_interval_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Depth recurrence with learned scalar gating for reused MLP passes
Polar Express NS with AOL preconditioning inside Muon
SWA blended with EMA
Partial RoPE
XSA on deep layers
Parallel residuals in late blocks
BigramHash feature
Late int6 QAT
Int6 + zstd-22 serialization fitting under the 16MB cap