PR #499

closed

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)

by newjordanView on GitHub

val_bpb

1.1478

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.19MB

Training Techniques

Architecture

depth recurrence / recursive weight sharing

6 unique transformer blocks are each applied twice in sequence, yielding 12 effective layers from shared parameters.

parameters: {"unique_blocks":6,"loops_per_block":2,"effective_depth":12}

MLP3x

Expanded feed-forward network to 4x width (hidden size 2560) using freed parameter budget.

parameters: {"mlp_multiplier":4,"hidden_size":2560}

Partial RoPE

Uses partial rotary position embeddings on a subset of dimensions with NTK-aware scaling.

parameters: {"rope_dims":16,"total_dims":64}

SmearGate

Custom gating mechanism used in the model.

parameters: null

BigramHash

Hash-based bigram feature module with shared embeddings.

parameters: {"buckets":2048,"dimension":128}

tied embeddings

Input and output embeddings are shared.

parameters: null

XSA

XSA applied on the last 2 unique blocks.

parameters: {"last_blocks":2}

Initialization

QR-initialized orthogonal loop position embeddings

Orthogonal loop position embeddings initialized with QR decomposition.

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"scope":"matrices","lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings and scalars","embedding_lr":0.035,"scalar_lr":0.025}

Weight Averaging

SWA

parameters: {"frequency":"every 50 steps when scale < 0.2"}

EMA

parameters: {"decay":0.997}

Quantization

int6 QAT

bits: 6

scope: MLP and attention weights

int8

bits: 8

scope: embeddings

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Other

other

Training data replay for the last 100 batches over 2 epochs.

parameters: {"epochs":2,"batches":100}

other

Self-distillation using an EMA teacher for 50 steps.

parameters: {"teacher":"EMA","steps":50,"temperature":2,"alpha":0.7}

Novel Contributions

Recursive weight sharing / fractal looping of 6 unique transformer blocks to create 12 effective layers
Reinvesting parameter savings into a 4x MLP expansion
Orthogonal loop position embeddings to distinguish repeated passes through shared blocks
U-Net skip connections within each loop iteration
Combination of SmearGate, BigramHash, shared value embeddings, and XSA in a compact transformer
Late QAT, training replay, self-distillation, SWA, and EMA used together under the artifact budget