PR #379

open

Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)

by dannywillowliu-uchiView on GitHub

val_bpb

1.1257

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all weights

Architecture

MLP3x

3x MLP expansion with relu-squared activation

parameters: {"expansion":3}

XSA

Efficient Partial XSA on the last 4 layers

parameters: {"last_n_layers":4}

RoPE

Partial RoPE with NTK-aware scaling

parameters: {"dimensions":"16/64"}

SmearGate

SmearGate gating mechanism

parameters: null

BigramHash

BigramHash with 2048 buckets and dim=128

parameters: {"buckets":2048,"dim":128}

KV head count

Grouped-query attention with 8 heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Value Embedding

Shared value embedding used in later layers

parameters: {"dim":128,"layers":[9,10]}

Weight Averaging

SWA

parameters: {"every_steps":50,"checkpoint_count":12,"scale_threshold":0.2}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

self-distillation TTT

parameters: {"temperature":2,"freeze_blocks":4,"epochs":2,"learning_rate":0.001}

Initialization

Orthogonal init

Orthogonal initialization with projection scaling

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

layerwise LN scale

parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}

Other

other

Late QAT with STE int6 applied when LR scale < 0.1

parameters: {"lr_scale_threshold":0.1}

Novel Contributions

GPTQ-lite: per-layer optimal clip percentile search during int6 quantization
Self-distillation TTT using a frozen teacher to preserve XSA attention patterns
Late QAT with STE int6 during training