PR #1586

open

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0749

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.93 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP and attention weights

GPTQ

bits: 7

scope: embeddings

GPTQ

bits: 6

scope: all matrices

Architecture

weight tying

Tied token embeddings / tied embeddings

parameters: null

LeakyReLU

Uses LeakyReLU activation in the MLP

parameters: {"slope":0.5}

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":"16/64"}

depth recurrence

Triple recurrence with selected layers looped multiple times

parameters: {"layers":"3-5","loops":2}

U-Net skip connections

Sigmoid-gated U-Net style skip connections

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: 0.5

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.75}

Test-Time Training

LoRA TTT

parameters: {"rank":96,"learning_rate":0.0001,"chunk_size":48,"weight_decay":0.5,"score_first":true,"doc_independent":true}

Compression

Brotli

level: 11

Novel Contributions

Per-layer adaptive GPTQ clipping with different clip_sigmas for MLP and attention layers
int7 token embeddings to reduce artifact size while preserving quality
Systematic tuning of MATRIX_LR to 0.026
Combining GPTQ quantization with doc-independent LoRA test-time training under the size budget