PR #2101
openRecord: AWQ-lite + AsymLogit + GradCentral + ... val_bpb=1.05845
by OnlyJundongView on GitHub
val_bpb
1.0584
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: block weights
GPTQ
bits: 7
scope: embeddings
GPTQ-lite
bits: null
scope: tok_emb.weight
int8
bits: 8
scope: top-K salient groups
Architecture
SmearGate
Attention gating mechanism with BOS fix
parameters: {"window":12}
XSA
Applied across all layers
parameters: {"layers":11}
depth recurrence
Layers are looped multiple times during forward pass
parameters: {"layers":[3,4,5],"loops":2}
Partial RoPE
Partial rotary position embeddings
parameters: {"dimensions":[16,64]}
LeakyReLU
MLP activation choice
parameters: {"slope":0.5}
weight tying
Tied embeddings
parameters: null
Gated Attention
Sparse/gated attention variants used in the model
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"backend_steps":5,"grad_centralize":true}
Evaluation
long context eval
parameters: {"prefix_docs":2500,"num_phases":3}
Test-Time Training
score-first TTT
parameters: {"phases":3,"chunk_size":48,"lora_rank":80}
Regularization
label smoothing
parameters: {"label_smooth":0}
logit softcap
parameters: {"softcap":30,"asymmetric_eval":true}
layerwise LN scale
parameters: null
Novel Contributions
- AWQ-lite integration on top of PR #1855
- AsymLogit integration with separate positive and negative softcap parameters
- Gradient centralization added as an optional Muon optimizer feature
- Label smoothing added as an optional training regularizer
- Extended eval-time support via configurable phased TTT prefix docs and number of phases