PR #1392

open

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB)

by Its-Just-CrumpView on GitHub
val_bpb
1.1020
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.88 MB

Training Techniques

Architecture
SentencePiece 4096
Uses a 4096-token SentencePiece tokenizer instead of the baseline SP1024.
parameters: {"vocab_size":4096}
MLP4x
Increases MLP width to 4x capacity with LeakyReLU squared activation.
parameters: {"multiplier":4}
depth recurrence
Re-executes layers 4 and 5 starting at step 3000 to add logical depth without adding parameters.
parameters: {"layers":[4,5],"start_step":3000}
Parallel Residuals
Merges MLP and attention outputs with learned lane_merge and resid_mix_mlp parameters from layer 7 onward.
parameters: {"start_layer":7}
QK-Gain
Initializes query and key projections with 5x scale.
parameters: {"gain":5}
RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Applies VE128 in layers 9-10.
parameters: {"layers":[9,10]}
SmearGate
Uses a position-mixing gate.
parameters: null
U-Net skip connections
Adds encoder-decoder skip connections.
parameters: null
XSA
Uses XSA attention on all layers.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"parallel":true,"muon_eq_r":true}
AdamW
weight_decay: 0.09
momentum: null
other_params: null
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Regularization
weight decay
parameters: {"adam_wd":0.09,"muon_wd":0.09}
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • SP4096-native architecture replacing SP1024 + BigramHash
  • Depth recurrence on layers 4-5 starting at step 3000
  • Parallel residuals with learned merge from layer 7 onward
  • QK-Gain 5.0 initialization
  • MuonEq-R optimizer variant
  • GPTQ calibration with 128 batches
  • Brotli compression for SP4096 int6 weights
  • Removal of BigramHash, TrigramHash, and TTT from the prior stack