PR #586
open11L + Hadamard Rotation + VE128 + cuDNN SDPA (val_bpb: 1.1365, 3-seed mean)
by EaCognitiveView on GitHub
val_bpb
1.1365
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
~15.6 MB
Training Techniques
Quantization
int6 per-row with Hadamard rotation
bits: 6
scope: all weights
Architecture
XSA
Exclusive Self-Attention on last 4 layers with GQA-aware design
parameters: {"layers":4}
SmearGate
Gating mechanism integrated in architecture
parameters: null
BigramHash
Bigram hashing with 2048 buckets and inner dimension 128
parameters: {"buckets":2048,"inner_dim":128}
Partial RoPE
Rotary positional embeddings applied partially (16/64 dims)
parameters: {"dimensions":16}
MLP3x
MLP with 3x expansion and relu-squared activation
parameters: {"expansion":3}
Shared Value Embeddings (VE128)
Shared value embeddings of dimension 128 on layers 9 and 10 with per-layer learned scales
parameters: {"dim":128,"layers":[9,10]}
Layer Norm Scale
Layer norm scale factor 1/sqrt(layer_idx+1)
parameters: null
U-Net skip connections
5 encoder and 6 decoder skip connections
parameters: {"encoder":5,"decoder":6}
cuDNN SDPA
cuDNN scaled dot-product attention backend with FlashAttention 3 conditional fallback
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"momentum_warmup_end":0.99}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"schedule":"cosine"}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Initialization
Orthogonal initialization
Orthogonal init with projection scaling by 1/sqrt(2*num_layers)
Other
other
Hadamard rotation applied to weight matrices before int6 quantization to spread outlier values uniformly, improving compression and reducing quantization gap
parameters: null
Novel Contributions
- First application of Walsh-Hadamard rotation for int6 per-row quantization in this competition
- Hadamard rotation improves zstd compression from 1.70x to 1.76x and reduces quantization gap from 0.0093 to 0.0084 BPB
- Hadamard rotation is data-free and deterministic, requiring no calibration or training data access at evaluation
- Hadamard rotation and GPTQ are substitutes at int6 precision; GPTQ adds no benefit when Hadamard rotation is used
- Compression improvement recovers 530KB of artifact headroom enabling Shared Value Embeddings (VE128) on layers 9-10
- CPU parameter probe guided hyperparameter selection across 9.5M configurations, reducing GPU compute by ~84%
- Identification and removal of dead QAT code improved throughput by 7%
- Quantizing BigramHash projection to int6 improves compression with negligible noise
- Use of cuDNN SDPA backend with FlashAttention 3 conditional fallback