PR #2097

closed

record: AWQ-lite + AsymLogit + GradCentr + LabSmooth - 1.05846 BPB

by OnlyJundongView on GitHub
val_bpb
1.0585
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Architecture
weight tying
Tied embeddings are used in the base architecture.
parameters: null
SmearGate
SmearGate attention is used with a windowed gate mechanism.
parameters: {"window":12}
XSA
XSA is applied across all layers.
parameters: {"layers":11}
depth recurrence
Layers 3-5 are looped twice during the forward pass.
parameters: {"layers":[3,4,5],"loops":2}
Partial RoPE
Partial rotary position embeddings are used.
parameters: {"dimensions":"16/64"}
LeakyReLU
LeakyReLU activation is used in the MLP.
parameters: {"slope":0.5}
ReLU²
Squared ReLU-style MLP activation is used.
parameters: null
Gated Attention
Sparse/gated attention mechanisms are used.
parameters: null
Quantization
GPTQ
bits: 6
scope: weights and embeddings
GPTQ-lite
bits: 7
scope: embeddings
mixed int6/int7/int8
bits: null
scope: tok_emb.weight
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"backend_steps":5}
Regularization
label smoothing
parameters: {"label_smooth":0}
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30,"asymmetric_eval":true}
weight decay
parameters: {"ttt_weight_decay":0.5}
Evaluation
long context eval
parameters: {"prefix_docs":2500,"num_phases":3}
Test-Time Training
score-first TTT
parameters: {"phases":3,"chunk_size":48,"lora_rank":80}
Other
other
AWQ-lite protects the most salient column groups at int8 precision.
parameters: {"enabled":true,"group_top_k":1}
other
AsymLogit replaces a single logit softcap with separate positive and negative learnable softcaps on the eval path.
parameters: {"enabled":true}
other
Gradient centralization subtracts the row mean from gradients inside Muon before the Newton-Schulz step.
parameters: {"enabled":false}

Novel Contributions

  • AWQ-lite integration
  • AsymLogit integration
  • Gradient centralization support in Muon
  • Label smoothing support in training
  • Longer eval time support via configurable phased TTT prefix docs and phases