PR #206

open

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)

by dexhunterView on GitHub

val_bpb

1.1507

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

14.79MB

Training Techniques

Quantization

int6 STE QAT

bits: 6

scope: all weights except fp16 tied embedding

Architecture

SmearGate

Learned gate blends token embeddings with predecessor representations.

parameters: {"params":512}

MLP3x

Wider MLP layers enabled by int6 compression savings.

parameters: {"hidden_size":1536}

RoPE

Rotary position embeddings with adjusted base frequency for longer context.

parameters: {"base":50000}

tied embeddings

Input and output embeddings are tied; embedding tensor is kept in fp16 and not quantized.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

NorMuon

weight_decay: 0.02

momentum: 0.99

other_params: {"beta2":0.95,"warmup_start":0.92,"matrix_lr":0.021,"scalar_lr":0.02,"tied_embed_lr":0.03}

Weight Averaging

SWA

parameters: {"every_steps":100,"start_fraction_of_warmdown":0.5}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1984}

Initialization

OrthoInit

Orthogonal initialization applied to all non-zero-init linear layers.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.02}

Other

other

U-Net style skip connections with learnable per-layer per-dimension skip weights.

parameters: null

Novel Contributions

Int6 straight-through estimator quantization during training
SmearGate token-to-predecessor embedding blending
Wider 3x MLP enabled by quantization savings
Orthogonal initialization across non-zero-init linear layers
Longer 2048-token training context with RoPE base 50K
Frequent SWA checkpoint averaging every 100 steps
Sliding-window evaluation with stride 64
U-Net skip connections in the model