PR #1661

open

Non-record: 11L DepthRec PolarNS SWA

by anderamondarainh-stackView on GitHub
val_bpb
1.1444
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,999,891 bytes

Training Techniques

Architecture
depth recurrence
Reuses MLP blocks across passes with learned scalar gating per reused pass.
parameters: {"reused_blocks":[4,5],"source_block":3}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Adds a bigram hash embedding feature.
parameters: {"buckets":3072,"dim":112}
XSA
Uses XSA in the deepest layers.
parameters: {"layers":4}
weight tying
Ties input and output embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
parallel residuals
Uses parallel residual connections in later layers.
parameters: {"start_layer":7}
Quantization
late QAT
bits: 6
scope: MLP and attention 2D weights
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"polar_express_coefficients":true,"aol_preconditioning":true,"newton_schulz_iters":5}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"scalars and embeddings"}
Weight Averaging
EMA + SWA
parameters: {"swa_start_scale":0.2,"swa_interval_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
layerwise LN scale
parameters: null

Novel Contributions

  • Depth recurrence with learned scalar gating for reused MLP passes
  • Polar Express NS with AOL preconditioning inside Muon
  • SWA blended with EMA
  • Partial RoPE
  • XSA on deep layers
  • Parallel residuals in late blocks
  • BigramHash feature
  • Late int6 QAT
  • Int6 + zstd-22 serialization fitting under the 16MB cap