PR #465

open

Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)

by LoquiAurisView on GitHub

val_bpb

1.1508

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,680,288 bytes

Training Techniques

Architecture

SmearGate

Learned blend with previous token representation.

parameters: null

BigramHash

Bigram hash feature with 4096 buckets projected to model width.

parameters: {"buckets":4096,"dim":128}

MLP3x

3x FFN expansion with ReLU² activation.

parameters: {"hidden":1536}

tied embeddings

Input and output embeddings are tied via linear projection.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

U-Net skip connections

Skip connections between symmetric layer pairs.

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Quantization

int5

bits: 5

scope: MLP

int6

bits: 6

scope: attention

int6

bits: 6

scope: embeddings

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"warmup_momentum":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"scope":"embeddings and scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.5,"checkpoint_every":50,"num_checkpoints":29}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmup + warmdown cosine decay

parameters: {"warmup_steps":20,"warmdown_steps":3000}

Regularization

weight decay

parameters: {"muon":0.04,"adamw":0.01}

Other

other

Use of BigramHash features and SmearGate in a PR #162 transformer stack with RoPE, RMSNorm, logit softcap, and GQA.

parameters: {"layers":10,"d_model":512,"vocab_size":1024}

Novel Contributions

Int5 quantization for MLP weights with Int6 quantization for attention weights under a 16 MB artifact budget.
Demonstration that sp1024 with 10 layers at d=512 outperformed larger-vocabulary sp8192 configurations.
Discovery that embedding tables can be quantized to Int6 with negligible quality loss.
Introduction of SmearGate and BigramHash within the PR #162 transformer stack.
Systematic architecture search across tokenizer sizes, widths, and depths with local Apple Silicon ablations and H100 confirmation.