PR #156

open

feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)

by dexhunterView on GitHub

val_bpb

1.1602

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15,045,740 bytes

Training Techniques

Quantization

int6

bits: 6

scope: per-row weights; embeddings kept fp16

Architecture

MLP3x

3x wider MLP hidden layer to increase capacity within the artifact budget

parameters: {"dimensions":1536}

tied embeddings

Embedding tensor is tied and stored in fp16, never quantized

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"layers":9,"model_dim":512,"attention_heads":8,"kv_heads":4}

RoPE

Uses RoPE positional encoding with learnable Q gain

parameters: {"q_gain_init":1.5}

Optimizer

NorMuon

weight_decay: null

momentum: 0.99

other_params: {"beta2":0.95,"matrix_lr":0.02,"warmdown_iters":3000,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}

Weight Averaging

SWA

parameters: {"checkpoints":7,"interval_steps":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":960}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Straight-through estimator with fake int6 per-row quantization applied on every forward pass during training

parameters: {"range":[-31,31]}

other

U-Net style skip connections with learnable per-layer per-dimension skip weights

parameters: null

Novel Contributions

Int6 STE fake quantization during training with straight-through gradient bypass
NorMuon optimizer with row-normalized Newton-Schulz updates
3x wider MLP enabled by int6 compression savings
FP16 tied embedding passthrough to protect quantization-sensitive weights
Sliding window evaluation with stride 64 for longer effective context
SWA over 7 checkpoints during warmdown
Zstd-22 artifact compression
U-Net skip connections with learnable skip weights