PR #1512

open

Record: Bank QAT + seq4096 + SWA w=256 + QK-Gain 2.5 + PKO — val_bpb 1.1117 (3-seed mean)

by ItssshikharView on GitHub

val_bpb

1.1117

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,116,257 bytes

Training Techniques

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Quantization

STE QAT

bits: 6

scope: all F.linear params

GPTQ

bits: 6

scope: all weights

Architecture

QK-Gain

Higher initial gain for QK-normalization

parameters: {"gain":2.5}

Partial RoPE

Shift stationary RoPE dimensions of K forward by 1 position

parameters: {"offset":1}

SWA

Sliding window attention on lower layers, full attention on upper layers

parameters: {"window_size":256,"full_attention_layers":5}

weight tying

Tied embeddings

parameters: null

BigramHash

Bigram hash embedding

parameters: {"buckets":3072,"dim":112}

U-Net skip connections

Encoder-decoder skip connections

parameters: {"encoder":5,"decoder":6}

SmearGate

SmearGate module

parameters: null

ReLU²

relu-squared activation

parameters: null

Compression

lzma

level: 9

Weight Averaging

SWA

parameters: {"window_size":256}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start_momentum":0.92}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"learning_rate":0.025}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections

Regularization

logit softcap

parameters: {"value":30}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Novel Contributions

seq4096 training
bank-weight QAT on all F.linear parameters
QK-Gain 2.5
partial key offset for stationary RoPE dimensions
SWA with window size 256
full Hessian GPTQ calibration
lzma compression