PR #1969

open

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)

by bsisduckView on GitHub
val_bpb
1.0804
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
LeakyReLU
MLP uses LeakyReLU squared activation in 4x2048 blocks
parameters: {"mlp_multiplier":4,"hidden_size":2048}
GQA
Grouped query attention with 2:1 grouping
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"base":10000,"total_dimensions":64}
U-Net skip connections
Encoder-decoder skip connections with skip gates
parameters: null
depth recurrence
Looped layers 3-5 for virtual depth expansion
parameters: {"loop_layers":[3,5],"num_loops":2,"virtual_layers":17}
SmearGate
Position-mixing gate widened to 32 dimensions
parameters: {"width":32}
Gated Attention
Per-head attention output gating widened from 12 to 32 input dimensions
parameters: {"gate_width":32}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"variant":"Polar-Express","ns_steps":5,"minimax_tuples":true}
Regularization
logit softcap
parameters: {"value":30}
Quantization
GPTQ
bits: 6
scope: all weights
Compression
brotli
level: 11
Test-Time Training
LoRA TTT
parameters: {"rank":96,"phases":1,"prefix_docs":2000}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
LR Schedule
warmdown
parameters: {"min_lr":0.1}

Novel Contributions

  • Wider attention output gates with GATE_WIDTH=32
  • Widened SmearGate to width 32
  • SP8192 CaseOps tokenizer with bijective case markers
  • GPTQ int6 quantization of all weights with brotli compression
  • Polar-Express Muon optimization setup
  • TTT with LoRA rank-96 on 2000 prefix docs