PR #1499

open

SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)

by dippatel1994View on GitHub

val_bpb

1.6323

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.09MB

Training Techniques

Architecture

BigramHash

Hash-based bigram embedding table

parameters: {"size":10240,"dim":128}

SmearGate

Learned gate blending adjacent token embeddings

parameters: null

LeakyReLU

Squared LeakyReLU activation in MLP

parameters: null

GQA

Grouped-query attention with fewer KV heads than query heads

parameters: {"query_heads":8,"kv_heads":4}

Partial RoPE

Rotary position embeddings applied to only part of the head dimension

parameters: {"dimensions":16}

Value Residual

ResFormer-style blending of current and initial value projections

parameters: {"alpha":0.95}

XSA

Extended self-attention on the last 4 layers

parameters: {"layers":4}

depth recurrence

Layers 3-5 are re-executed a second time for 13 effective layers from 10 physical layers

parameters: {"layers":[3,4,5],"repeats":2}

U-Net skip connections

5 encoder and 5 decoder layers with learned skip weights

parameters: {"encoder_layers":5,"decoder_layers":5}

parallel residuals

Attention and MLP run in parallel from the same normalized input on layers 7+

parameters: {"start_layer":7}

Weight Averaging

EMA

parameters: {"decay":0.997}

Initialization

OrthoInit

Orthogonal initialization with 1/sqrt(2*num_layers) scaling on output projections

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Quantization

GPTQ

bits: 5

scope: MLP

GPTQ

bits: 6

scope: attention

late QAT

bits: null

scope: all

Regularization

weight decay

parameters: {"value":0.04}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Novel Contributions

SP8192 tokenizer to reduce tokens per byte
Depth recurrence over layers 3-5 for increased effective depth without increasing parameter count
Parallel residuals on deeper layers
U-Net style skip connections with learned skip weights
Full-Hessian GPTQ with percentile-search scale selection
14.09MB compressed artifact under the competition limit