PR #400

open

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)

by chanwoo-park-officialView on GitHub
val_bpb
1.1296
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,581,348 bytes

Training Techniques

Quantization
int6
bits: 6
scope: mlp, attn
QAT
bits: 6
scope: mlp, attn
Architecture
BigramHash
Uses BigramHash as part of the leaderboard-aligned stack.
parameters: null
SmearGate
Uses SmearGate as part of the leaderboard-aligned stack.
parameters: null
XSA
Enables XSA only on the last 4 transformer blocks.
parameters: {"layers":4}
Partial RoPE
Applies RoPE to only part of the dimensions.
parameters: {"dimensions":16}
MLP3x
Uses a 3x MLP expansion.
parameters: {"multiplier":3}
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
CANON
Adds CANON convolutional path with scoped placement on the last 5 layers and delta gating.
parameters: {"kernel":3,"last_n":5,"delta_gate":1}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
SWA
parameters: {"enabled":true,"tight_swa":true,"every":50,"start_lrmul":0.2,"max_checkpoints":12}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
CANON delta gate near-identity init
Initializes CANON delta gate with g=-4.0 so the path starts near identity.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"matrix":0.04,"adam":0.04}
layerwise LN scale
parameters: {"enabled":true}

Novel Contributions

  • Scoped CANON placement on the last 5 layers (AC(last5))
  • CANON delta gate to modulate the residual CANON path
  • Tight SWA schedule under a 600-second wallclock cap
  • Combination of AC(last5)+delta with leaderboard-aligned components to improve val_bpb
  • Int6 quantization with QAT on MLP and attention weights