PR #2159

open

Swiglu gating_ QAT_Residual Attention Scalin _EMA- Sliding window_Optimizations for 10min/16MB Track

by visin109View on GitHub
val_bpb
1.5990
Architecture
Transformer
Optimizer
Muon
Artifact Size
7.3 MB

Training Techniques

Quantization
QAT
bits: 8
scope: all linear weights
Architecture
SwiGLU
Replaced standard ReLU² MLP with a gated SwiGLU-style MLP.
parameters: null
residual mixing
Learnable blending between current activation and skip input.
parameters: null
attention scaling
Learnable per-head query scaling factor q_gain.
parameters: {"num_heads":4}
per-channel residual scaling
Learnable per-channel scaling for attention and MLP outputs.
parameters: null
GQA
Used grouped query attention for efficiency.
parameters: {"num_heads":4,"num_kv_heads":4}
RoPE
Applied rotary positional embeddings to queries and keys.
parameters: {"base":10000}
U-Net skip connections
Encoder-decoder style skip reuse across layers.
parameters: {"layers":8,"encoder_layers":4,"decoder_layers":4}
Weight Averaging
EMA
parameters: {"decay":0.99995}
SWA
parameters: {"start_step":840}
Optimizer
Muon
weight_decay: null
momentum: 0.85
other_params: {"momentum_end":0.95,"warmup_steps":500}
Adam
weight_decay: null
momentum: null
other_params: {"parameter_groups":["embeddings","scalar/vector params","LM head"]}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride_fraction":0.25}
LR Schedule
warmdown
parameters: {"warmup_iters":20,"warmdown_iters":700}
Sequence Length
sequence_length
train_length: 128
eval_length: null
Regularization
logit softcap
parameters: {"value":30}

Novel Contributions

  • STE-based quantization-aware training for int8 export
  • SwiGLU gated MLP replacement
  • Learnable residual mixing and per-channel residual scaling
  • Learnable per-head attention query scaling
  • EMA plus SWA stabilization
  • Muon and Adam multi-optimizer parameter grouping
  • Encoder-decoder style skip connection reuse
  • Sliding-window validation for BPB estimation
  • Per-row/per-tensor int8 export with percentile clipping and small-tensor FP16 passthrough