PR #2101

open

Record: AWQ-lite + AsymLogit + GradCentral + ... val_bpb=1.05845

by OnlyJundongView on GitHub
val_bpb
1.0584
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: block weights
GPTQ
bits: 7
scope: embeddings
GPTQ-lite
bits: null
scope: tok_emb.weight
int8
bits: 8
scope: top-K salient groups
Architecture
SmearGate
Attention gating mechanism with BOS fix
parameters: {"window":12}
XSA
Applied across all layers
parameters: {"layers":11}
depth recurrence
Layers are looped multiple times during forward pass
parameters: {"layers":[3,4,5],"loops":2}
Partial RoPE
Partial rotary position embeddings
parameters: {"dimensions":[16,64]}
LeakyReLU
MLP activation choice
parameters: {"slope":0.5}
weight tying
Tied embeddings
parameters: null
Gated Attention
Sparse/gated attention variants used in the model
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"backend_steps":5,"grad_centralize":true}
Evaluation
long context eval
parameters: {"prefix_docs":2500,"num_phases":3}
Test-Time Training
score-first TTT
parameters: {"phases":3,"chunk_size":48,"lora_rank":80}
Regularization
label smoothing
parameters: {"label_smooth":0}
logit softcap
parameters: {"softcap":30,"asymmetric_eval":true}
layerwise LN scale
parameters: null

Novel Contributions

  • AWQ-lite integration on top of PR #1855
  • AsymLogit integration with separate positive and negative softcap parameters
  • Gradient centralization added as an optional Muon optimizer feature
  • Label smoothing added as an optional training regularizer
  • Extended eval-time support via configurable phased TTT prefix docs and number of phases