PR #1499

open

SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)

by dippatel1994View on GitHub
val_bpb
1.6323
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.09MB

Training Techniques

Architecture
BigramHash
Hash-based bigram embedding table
parameters: {"size":10240,"dim":128}
SmearGate
Learned gate blending adjacent token embeddings
parameters: null
LeakyReLU
Squared LeakyReLU activation in MLP
parameters: null
GQA
Grouped-query attention with fewer KV heads than query heads
parameters: {"query_heads":8,"kv_heads":4}
Partial RoPE
Rotary position embeddings applied to only part of the head dimension
parameters: {"dimensions":16}
Value Residual
ResFormer-style blending of current and initial value projections
parameters: {"alpha":0.95}
XSA
Extended self-attention on the last 4 layers
parameters: {"layers":4}
depth recurrence
Layers 3-5 are re-executed a second time for 13 effective layers from 10 physical layers
parameters: {"layers":[3,4,5],"repeats":2}
U-Net skip connections
5 encoder and 5 decoder layers with learned skip weights
parameters: {"encoder_layers":5,"decoder_layers":5}
parallel residuals
Attention and MLP run in parallel from the same normalized input on layers 7+
parameters: {"start_layer":7}
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
OrthoInit
Orthogonal initialization with 1/sqrt(2*num_layers) scaling on output projections
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Quantization
GPTQ
bits: 5
scope: MLP
GPTQ
bits: 6
scope: attention
late QAT
bits: null
scope: all
Regularization
weight decay
parameters: {"value":0.04}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • SP8192 tokenizer to reduce tokens per byte
  • Depth recurrence over layers 3-5 for increased effective depth without increasing parameter count
  • Parallel residuals on deeper layers
  • U-Net style skip connections with learned skip weights
  • Full-Hessian GPTQ with percentile-search scale selection
  • 14.09MB compressed artifact under the competition limit