PR #1235

open

Non-record: 21L PRP experiment

val_bpb

1.3527

Architecture

Transformer

Optimizer

Muon

Artifact Size

17.7 MB

Training Techniques

Architecture

weight tying

Tied embeddings are enabled.

parameters: null

GQA

Uses 8 query heads and 2 KV heads.

parameters: {"num_heads":8,"num_kv_heads":2}

shared attention

Interior transformer layers share attention modules in adjacent pairs.

parameters: {"shared_pairs":4}

SwiGLU

Replaces the baseline MLP with a PRP-based SwiGLU block.

parameters: {"mlp_mult":4}

PRP

Parametrized random projection MLP with fixed random projection buffer and low-dimensional trainable controls.

parameters: {"rank":32}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.03,"scalar_lr":0.03,"prp_lr":0.06,"tied_embed_lr":0.03}

LR Schedule

cosine decay

parameters: {"warmup_steps":2,"main_scale":0.5,"min_scale":0.05,"gamma":0.8}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Compression

zlib

level: null

21-layer Transformer with shared attention across adjacent interior layer pairs
PRP-based SwiGLU MLP with low-dimensional trainable controls
8k SentencePiece tokenizer and 2048-token training context
Separate optimizer group for PRP vector controls
Shared-aware int8 export that deduplicates repeated storage before counting bytes