PR #1068

open

Non-record submission: 1.15 BPB in 16MB (GPTv3)

by LappyGView on GitHub

val_bpb

1.1510

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.1 MB

Training Techniques

Architecture

BigramHash

Hashes previous/current token pairs into a learned bigram embedding added to token embeddings.

parameters: {"bigram_vocab":10240,"bigram_dim":128}

SmearGate

Learned per-dimension sigmoid gate blending each token embedding with the previous token embedding.

parameters: null

MLP3x

Wider feed-forward network with 3x MLP width.

parameters: {"mlp_mult":3}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RoPE

Rotary positional embeddings.

parameters: {"rope_base":10000}

U-Net skip connections

Skip connections between encoder and decoder halves.

parameters: null

Quantization

STE QAT

bits: 6

scope: all large weight matrices

Compression

zstd

level: 22

Weight Averaging

SWA

parameters: {"final_steps":600,"snapshot_interval":50}

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

logit softcap

parameters: {"value":30}

Initialization

OrthoInit

Orthogonal weight initialization.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_params":true,"embeddings_and_scalars":"Adam"}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":1200}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

BigramHash embedding
SmearGate
Int6 QAT with STE
zstd-22 artifact compression
SWA over final checkpoints
sliding window evaluation