PR #1337
open[Non-Record] LegendreGPT: Legendre polynomial depth parameterization
by sergimichiView on GitHub
val_bpb
1.2079
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.70 MB
Training Techniques
Architecture
weight tying
Factorized embedding and tied logit head; layer weights are generated from Legendre polynomial coefficients across depth.
parameters: {"layers":24,"groups":2,"virtual_layers":24,"degree_attn":5,"degree_ffn":2}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
ReLU²
Uses ReLU squared MLP activation.
parameters: null
RoPE
Rotary positional embeddings.
parameters: null
weight tying
Tied embedding and output head with ALBERT-style factorized embeddings.
parameters: {"embedding_projection":[1024,128,512]}
MLP3x
Feed-forward network uses 3x model dimension.
parameters: {"multiplier":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"all 2D weight matrices"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalars"}
Quantization
mixed int8/int7
bits: null
scope: Legendre orders and sandwich blocks
Compression
lzma
level: null
LR Schedule
linear decay
parameters: {"start":0.2,"end":0,"steps":60000}
Regularization
logit softcap
parameters: {"cap":30}
Novel Contributions
- Legendre polynomial depth parameterization for transformer weights
- Two-group architecture to reduce gradient cancellation across layers
- Mixed-precision quantization scheme using INT8 for low-order Legendre coefficients and INT7 for higher orders
- Demonstration that orthogonal polynomial parameterization can work for language models
- ALBERT-style factorized embeddings combined with sandwich-style independent first and last blocks