PR #324

closed

[Non-Record] QAT + NTK-4096 Eval + Cosine Warmdown + Aggressive SWA

by crony-ioView on GitHub
val_bpb
1.1702
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,875,110 bytes

Training Techniques

Architecture
MLP3x
Increased model capacity to 10 layers with 3x MLP expansion (hidden=1536).
parameters: {"layers":10,"hidden":1536,"mlp_expansion":3}
SmearGate
Learned gate to blend consecutive token embeddings for better local context.
parameters: null
BigramHash
Added a 10240-bucket bigram hash embedding with dimension 128.
parameters: {"buckets":10240,"dimensions":128}
skip connections
Introduced learnable U-Net style skip connections.
parameters: null
RoPE
NTK-aware RoPE frequency rescaling for longer evaluation context.
parameters: {"context_length":4096}
Initialization
OrthoInit
Orthogonal weight initialization with gain-scaled projections and phase-transition sigmoid init for residual mix.
LR Schedule
cosine warmdown
parameters: {"formula":"0.5 * (1 + cos(pi*t))"}
Weight Averaging
SWA
parameters: {"start_fraction":0.35,"every_steps":25,"checkpoints_averaged_best_run":48}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"gradient_clipping":0.3,"momentum_warmup_end":0.99,"momentum_warmup_steps":1500}
Quantization
STE QAT
bits: 5
scope: MLPs
STE QAT
bits: 6
scope: Attention
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Compression
lzma
level: null
Regularization
magnitude pruning
parameters: {"prune_fraction":0.05}

Novel Contributions

  • Quantization-aware training with straight-through estimator fake quantization
  • Mixed int5/int6 quantization scheme
  • Cosine warmdown schedule
  • Aggressive SWA with frequent checkpoint averaging
  • SmearGate local-context blending
  • BigramHash embedding
  • Learnable U-Net style skip connections
  • NTK-aware RoPE rescaling for 4096-token evaluation
  • Sliding-window evaluation with stride 64
  • 5% magnitude pruning
  • lzma PRESET_EXTREME compression