PR #312

open

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)

by chanwoo-park-officialView on GitHub

val_bpb

1.1668

Architecture

Transformer

Optimizer

Muon

Artifact Size

13,267,347 bytes

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6; other large tensors int8

Architecture

Canon ACD

Canon layers placed before attention, before MLP, and in widened MLP hidden stream, avoiding the expensive QKV placement.

parameters: {"set":"ACD","kernel":3}

BigramHash

Bigram hash embedding added as context extra.

parameters: {"bigram_vocab_size":2048,"bigram_dim":128}

SmearGate

SmearGate context/architecture component used alongside bigram hash embeddings.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

MLP3x

Transformer MLP widened with multiplier 3.0.

parameters: {"mlp_mult":3}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"adam_weight_decay":0.04}

Weight Averaging

SWA

parameters: {"enabled":1,"every":200,"start_lrmul":0.5,"averaged_checkpoints":8}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}

Novel Contributions

Mixed int6 quantization for MLP and attention with int8 for other large tensors
Canon ACD placement with kernel size 3 to retain Canon benefits while avoiding QKV cost
Bigram hash embedding and SmearGate context extras
Muon + Adam mixed optimization with momentum warmup and warmdown
SWA near the end of training
Sliding-window evaluation with stride 64 as the main comparison metric