PR #1226

open

Non-record: 4090 single-GPU ablations on ValCalib GPTQ + XSA stack (partial logs)

by Wolfie8935View on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,965,978 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP and attention weights

Architecture

BigramHash

Hash-based bigram embedding over consecutive token pairs with learned projection to model dimension.

parameters: {"buckets":10240,"dim":128}

weight tying

Tied embeddings between input and output representations.

parameters: null

SmearGate

SmearGate component used in the model stack.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-times expansion MLP.

parameters: {"hidden":1536}

ReLU²

Squared ReLU activation in the MLP.

parameters: null

U-Net skip connections

Skip connections inspired by U-Net added to the transformer stack.

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"scope":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50,"checkpoints":24}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"sparsity":"3%"}

Compression

zstd

level: null

Novel Contributions

Mixed int5/int6 quantization with int5 applied to MLP weights and int6 to attention weights
BigramHash enlarged to 10240 buckets to reduce token-pair collisions
SWA with start_frac=0.4 using only the most converged checkpoints
10-layer model enabled by savings from int5 MLP quantization
Single-GPU 4090 ablation documentation for the ValCalib GPTQ + XSA stack