PR #2101

open

Record: AWQ-lite + AsymLogit + GradCentral + ... val_bpb=1.05845

by OnlyJundongView on GitHub

val_bpb

1.0584

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: block weights

GPTQ

bits: 7

scope: embeddings

GPTQ-lite

bits: null

scope: tok_emb.weight

int8

bits: 8

scope: top-K salient groups

Architecture

SmearGate

Attention gating mechanism with BOS fix

parameters: {"window":12}

XSA

Applied across all layers

parameters: {"layers":11}

depth recurrence

Layers are looped multiple times during forward pass

parameters: {"layers":[3,4,5],"loops":2}

Partial RoPE

Partial rotary position embeddings

parameters: {"dimensions":[16,64]}

LeakyReLU

MLP activation choice

parameters: {"slope":0.5}

weight tying

Tied embeddings

parameters: null

Gated Attention

Sparse/gated attention variants used in the model

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"backend_steps":5,"grad_centralize":true}

Evaluation

long context eval

parameters: {"prefix_docs":2500,"num_phases":3}

Test-Time Training

score-first TTT

parameters: {"phases":3,"chunk_size":48,"lora_rank":80}

Regularization

label smoothing

parameters: {"label_smooth":0}

logit softcap

parameters: {"softcap":30,"asymmetric_eval":true}

layerwise LN scale

parameters: null

Novel Contributions

AWQ-lite integration on top of PR #1855
AsymLogit integration with separate positive and negative softcap parameters
Gradient centralization added as an optional Muon optimizer feature
Label smoothing added as an optional training regularizer
Extended eval-time support via configurable phased TTT prefix docs and number of phases