PR #1573

open

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)

by shivangbavejaView on GitHub

val_bpb

1.1464

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.93 MB

Training Techniques

Architecture

depth recurrence

Recycled-core depth with layers replayed to create more virtual layers than physical layers.

parameters: {"layers":12,"virtual_layers":14,"replayed_layers":[3,4]}

LeakyReLU

LeakyReLU(0.5)^2 MLP activation.

parameters: {"negative_slope":0.5}

Gated Attention

Per-head sigmoid gate on attention output.

parameters: null

Value Residual

First-layer values are blended into subsequent layers.

parameters: null

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Partial rotary position embeddings on a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LN scale

LayerNorm scale uses 1/sqrt(layer+1).

parameters: null

U-Net skip connections

Learned skip weights provide U-Net style skip connections.

parameters: null

BigramHash

Bigram hash embedding with 2048 buckets.

parameters: {"buckets":2048}

SmearGate

Per-dimension sigmoid gating.

parameters: null

VE128

Value embeddings used in later layers.

parameters: {"dimensions":128,"layers":[10,11]}

Quantization

GPTQ-lite

bits: 5

scope: MLP and attention weights

Compression

lzma

level: 9

Weight Averaging

SWA

parameters: {"every":50,"snapshots":13}

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"lr":0.025}

Adam

weight_decay: 0.04

momentum: null

other_params: {"scope":"embeddings/scalars"}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Novel Contributions

Recycled-core depth with replayed layers to create 14 virtual layers from 12 physical layers
Int5 quantization of both MLP and attention weights to fit under the 16 MB limit
Gated attention with value residual blending
LeakyReLU(0.5)^2 MLP activation
Combination of SWA and EMA with Muon optimization
Partial RoPE, XSA, BigramHash, SmearGate, and value embeddings in later layers