PR #1337

open

[Non-Record] LegendreGPT: Legendre polynomial depth parameterization

by sergimichiView on GitHub

val_bpb

1.2079

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.70 MB

Training Techniques

Architecture

weight tying

Factorized embedding and tied logit head; layer weights are generated from Legendre polynomial coefficients across depth.

parameters: {"layers":24,"groups":2,"virtual_layers":24,"degree_attn":5,"degree_ffn":2}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

ReLU²

Uses ReLU squared MLP activation.

parameters: null

RoPE

Rotary positional embeddings.

parameters: null

weight tying

Tied embedding and output head with ALBERT-style factorized embeddings.

parameters: {"embedding_projection":[1024,128,512]}

MLP3x

Feed-forward network uses 3x model dimension.

parameters: {"multiplier":3}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"all 2D weight matrices"}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalars"}

Quantization

mixed int8/int7

bits: null

scope: Legendre orders and sandwich blocks

Compression

lzma

level: null

LR Schedule

linear decay

parameters: {"start":0.2,"end":0,"steps":60000}

Regularization

logit softcap

parameters: {"cap":30}

Novel Contributions

Legendre polynomial depth parameterization for transformer weights
Two-group architecture to reduce gradient cancellation across layers
Mixed-precision quantization scheme using INT8 for low-order Legendre coefficients and INT7 for higher orders
Demonstration that orthogonal polynomial parameterization can work for language models
ALBERT-style factorized embeddings combined with sandwich-style independent first and last blocks