PR #1361

open

1.1220 bpb: GPTQ + EMA + XSA-all + BigramHash3072 (11L 512dim)

by jorge-asenjoView on GitHub

val_bpb

1.1220

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.1 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

BigramHash

Token-pair hash embeddings for richer input representation.

parameters: {"buckets":3072,"dims":112}

SmearGate

Learned token-level blending with previous position.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections across layers.

parameters: {"encoder_layers":5,"decoder_layers":6,"skip_weights":5}

Value Embedding

Shared value embedding table injected into attention values at later layers.

parameters: {"layers":[9,10],"table_shape":"1024x128"}

XSA

Exclusive Self-Attention applied to all layers.

parameters: {"layers":11}

Partial RoPE

Rotary embeddings applied to a subset of head dimensions.

parameters: {"rotary_dims":16,"head_dims":64}

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

MLP3x

Three-times expanded MLP with LeakyReLU² activation.

parameters: {"activation":"LeakyReLU²"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: MLP + attention weights

late QAT

bits: 6

scope: final warmdown phase

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32,"seq_len":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"ns_steps":5,"lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iterations":3200}

Regularization

LN scale

parameters: {"rule":"1/sqrt(layer_idx+1)"}

logit softcap

parameters: {"value":30}

Novel Contributions

XSA applied to all 11 layers
BigramHash token-pair embeddings with 3072 buckets
GPTQ-based Hessian quantization with late QAT
EMA-weighted final model
Value Embedding injected into later attention layers
U-Net skip connections combined with partial RoPE and SmearGate