PR #206

open

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)

by dexhunterView on GitHub
val_bpb
1.1507
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.79MB

Training Techniques

Quantization
int6 STE QAT
bits: 6
scope: all weights except fp16 tied embedding
Architecture
SmearGate
Learned gate blends token embeddings with predecessor representations.
parameters: {"params":512}
MLP3x
Wider MLP layers enabled by int6 compression savings.
parameters: {"hidden_size":1536}
RoPE
Rotary position embeddings with adjusted base frequency for longer context.
parameters: {"base":50000}
tied embeddings
Input and output embeddings are tied; embedding tensor is kept in fp16 and not quantized.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
NorMuon
weight_decay: 0.02
momentum: 0.99
other_params: {"beta2":0.95,"warmup_start":0.92,"matrix_lr":0.021,"scalar_lr":0.02,"tied_embed_lr":0.03}
Weight Averaging
SWA
parameters: {"every_steps":100,"start_fraction_of_warmdown":0.5}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1984}
Initialization
OrthoInit
Orthogonal initialization applied to all non-zero-init linear layers.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.02}
Other
other
U-Net style skip connections with learnable per-layer per-dimension skip weights.
parameters: null

Novel Contributions

  • Int6 straight-through estimator quantization during training
  • SmearGate token-to-predecessor embedding blending
  • Wider 3x MLP enabled by quantization savings
  • Orthogonal initialization across non-zero-init linear layers
  • Longer 2048-token training context with RoPE base 50K
  • Frequent SWA checkpoint averaging every 100 steps
  • Sliding-window evaluation with stride 64
  • U-Net skip connections in the model