PR #1570

open

SP8192_LayerRecur_ParResid_QK525_paramOpt

by yufang67View on GitHub

val_bpb

1.0970

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.07 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence applied to layers 3-5 with two loops, activated partway through training.

parameters: {"layers":[3,4,5],"loops":2,"activated_at_frac":0.35}

parallel residuals

GPT-J style parallel residual path where attention and MLP read from the same input in later layers.

parameters: {"start_layer":7}

U-Net skip connections

Sigmoid-gated skip connections used in a U-Net-like pattern.

parameters: null

RoPE

Rotary positional embeddings with 32 dimensions.

parameters: {"dimensions":32}

weight tying

Tied input and output embeddings.

parameters: null

KV head count

Grouped attention configuration with fewer KV heads than query heads.

parameters: {"heads":8,"kv_heads":4}

Regularization

logit softcap

parameters: {"value":20}

layerwise LN scale

parameters: null

weight decay

parameters: {"muon":0.095,"adam":0.02,"embed":0.085}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: attention and MLP matrices

GPTQ

bits: 8

scope: embeddings

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Compression

Brotli

level: null

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

SP8192 tokenizer with 8192-vocab SentencePiece BPE
3-layer depth recurrence in layers 3-5
Parallel residuals in later layers
Sigmoid-gated U-Net skip connections
Learnable per-head QK gain scaling
Full-Hessian GPTQ with SDClip quantization
Brotli-compressed artifact
Sliding window evaluation without TTT