PR #326

open

Non-Record: QAT + NTK-4096 Eval + Cosine Warmdown + Aggressive SWA (val_bpb=1.2890, 1xh100)

by crony-ioView on GitHub
val_bpb
1.2890
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
MLP3x
Increased model capacity to 10 layers with 3x MLP expansion (hidden=1536).
parameters: {"layers":10,"hidden":1536}
SmearGate
Learned gate to blend consecutive token embeddings for better local context.
parameters: null
BigramHash
Added a bigram hash embedding for local token-pair context.
parameters: {"buckets":10240,"dimensions":128}
RoPE
NTK-aware RoPE rescaling for longer evaluation context.
parameters: {"eval_length":4096}
skip connections
Introduced learnable U-Net style skip connections.
parameters: null
Initialization
Orthogonal init
Orthogonal weight initialization with gain-scaled projections and phase-transition sigmoid residual mix initialization.
LR Schedule
cosine warmdown
parameters: null
Weight Averaging
SWA
parameters: {"start_fraction":0.35,"every_steps":25,"checkpoints_averaged":48}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"gradient_clipping":0.3,"momentum_warmup":{"start":0.92,"end":0.99,"steps":1500}}
Quantization
STE QAT
bits: 5
scope: MLPs and Attention
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Compression
lzma
level: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
magnitude pruning
parameters: {"pruned_fraction":0.05}

Novel Contributions

  • Quantization-aware training with straight-through estimator fake quantization instead of post-training quantization
  • Mixed int5/int6 quantization scheme for different layer types
  • NTK-aware RoPE rescaling for 4096-length evaluation
  • Sliding-window evaluation with stride 64
  • Cosine warmdown learning rate schedule
  • Aggressive stochastic weight averaging starting at 35% of warmdown
  • SmearGate local context mixing
  • BigramHash embedding for token-pair context
  • Learnable U-Net style skip connections
  • Orthogonal initialization with phase-transition residual mix
  • 5% magnitude pruning
  • lzma compression with PRESET_EXTREME