PR #586

open

11L + Hadamard Rotation + VE128 + cuDNN SDPA (val_bpb: 1.1365, 3-seed mean)

by EaCognitiveView on GitHub

val_bpb

1.1365

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

~15.6 MB

Training Techniques

Quantization

int6 per-row with Hadamard rotation

bits: 6

scope: all weights

Architecture

XSA

Exclusive Self-Attention on last 4 layers with GQA-aware design

parameters: {"layers":4}

SmearGate

Gating mechanism integrated in architecture

parameters: null

BigramHash

Bigram hashing with 2048 buckets and inner dimension 128

parameters: {"buckets":2048,"inner_dim":128}

Partial RoPE

Rotary positional embeddings applied partially (16/64 dims)

parameters: {"dimensions":16}

MLP3x

MLP with 3x expansion and relu-squared activation

parameters: {"expansion":3}

Shared Value Embeddings (VE128)

Shared value embeddings of dimension 128 on layers 9 and 10 with per-layer learned scales

parameters: {"dim":128,"layers":[9,10]}

Layer Norm Scale

Layer norm scale factor 1/sqrt(layer_idx+1)

parameters: null

U-Net skip connections

5 encoder and 6 decoder skip connections

parameters: {"encoder":5,"decoder":6}

cuDNN SDPA

cuDNN scaled dot-product attention backend with FlashAttention 3 conditional fallback

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"momentum_warmup_end":0.99}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"schedule":"cosine"}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Initialization

Orthogonal initialization

Orthogonal init with projection scaling by 1/sqrt(2*num_layers)

Other

other

Hadamard rotation applied to weight matrices before int6 quantization to spread outlier values uniformly, improving compression and reducing quantization gap

parameters: null

Novel Contributions

First application of Walsh-Hadamard rotation for int6 per-row quantization in this competition
Hadamard rotation improves zstd compression from 1.70x to 1.76x and reduces quantization gap from 0.0093 to 0.0084 BPB
Hadamard rotation is data-free and deterministic, requiring no calibration or training data access at evaluation
Hadamard rotation and GPTQ are substitutes at int6 precision; GPTQ adds no benefit when Hadamard rotation is used
Compression improvement recovers 530KB of artifact headroom enabling Shared Value Embeddings (VE128) on layers 9-10
CPU parameter probe guided hyperparameter selection across 9.5M configurations, reducing GPU compute by ~84%
Identification and removal of dead QAT code improved throughput by 7%
Quantizing BigramHash projection to int6 improves compression with negligible noise
Use of cuDNN SDPA backend with FlashAttention 3 conditional fallback