PR #400

open

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)

by chanwoo-park-officialView on GitHub

val_bpb

1.1296

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,581,348 bytes

Training Techniques

Quantization

int6

bits: 6

scope: mlp, attn

QAT

bits: 6

scope: mlp, attn

Architecture

BigramHash

Uses BigramHash as part of the leaderboard-aligned stack.

parameters: null

SmearGate

Uses SmearGate as part of the leaderboard-aligned stack.

parameters: null

XSA

Enables XSA only on the last 4 transformer blocks.

parameters: {"layers":4}

Partial RoPE

Applies RoPE to only part of the dimensions.

parameters: {"dimensions":16}

MLP3x

Uses a 3x MLP expansion.

parameters: {"multiplier":3}

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

CANON

Adds CANON convolutional path with scoped placement on the last 5 layers and delta gating.

parameters: {"kernel":3,"last_n":5,"delta_gate":1}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

SWA

parameters: {"enabled":true,"tight_swa":true,"every":50,"start_lrmul":0.2,"max_checkpoints":12}

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

CANON delta gate near-identity init

Initializes CANON delta gate with g=-4.0 so the path starts near identity.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"matrix":0.04,"adam":0.04}

layerwise LN scale

parameters: {"enabled":true}

Novel Contributions

Scoped CANON placement on the last 5 layers (AC(last5))
CANON delta gate to modulate the residual CANON path
Tight SWA schedule under a 600-second wallclock cap
Combination of AC(last5)+delta with leaderboard-aligned components to improve val_bpb
Int6 quantization with QAT on MLP and attention weights