PR #2097

closed

record: AWQ-lite + AsymLogit + GradCentr + LabSmooth - 1.05846 BPB

by OnlyJundongView on GitHub

val_bpb

1.0585

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Architecture

weight tying

Tied embeddings are used in the base architecture.

parameters: null

SmearGate

SmearGate attention is used with a windowed gate mechanism.

parameters: {"window":12}

XSA

XSA is applied across all layers.

parameters: {"layers":11}

depth recurrence

Layers 3-5 are looped twice during the forward pass.

parameters: {"layers":[3,4,5],"loops":2}

Partial RoPE

Partial rotary position embeddings are used.

parameters: {"dimensions":"16/64"}

LeakyReLU

LeakyReLU activation is used in the MLP.

parameters: {"slope":0.5}

ReLU²

Squared ReLU-style MLP activation is used.

parameters: null

Gated Attention

Sparse/gated attention mechanisms are used.

parameters: null

Quantization

GPTQ

bits: 6

scope: weights and embeddings

GPTQ-lite

bits: 7

scope: embeddings

mixed int6/int7/int8

bits: null

scope: tok_emb.weight

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"backend_steps":5}

Regularization

label smoothing

parameters: {"label_smooth":0}

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30,"asymmetric_eval":true}

weight decay

parameters: {"ttt_weight_decay":0.5}

Evaluation

long context eval

parameters: {"prefix_docs":2500,"num_phases":3}

Test-Time Training

score-first TTT

parameters: {"phases":3,"chunk_size":48,"lora_rank":80}

Other

other

AWQ-lite protects the most salient column groups at int8 precision.

parameters: {"enabled":true,"group_top_k":1}

other

AsymLogit replaces a single logit softcap with separate positive and negative learnable softcaps on the eval path.

parameters: {"enabled":true}

other

Gradient centralization subtracts the row mean from gradients inside Muon before the Newton-Schulz step.

parameters: {"enabled":false}

Novel Contributions

AWQ-lite integration
AsymLogit integration
Gradient centralization support in Muon
Label smoothing support in training
Longer eval time support via configurable phased TTT prefix docs and phases