PR #465

open

Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)

by LoquiAurisView on GitHub
val_bpb
1.1508
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,680,288 bytes

Training Techniques

Architecture
SmearGate
Learned blend with previous token representation.
parameters: null
BigramHash
Bigram hash feature with 4096 buckets projected to model width.
parameters: {"buckets":4096,"dim":128}
MLP3x
3x FFN expansion with ReLU² activation.
parameters: {"hidden":1536}
tied embeddings
Input and output embeddings are tied via linear projection.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
U-Net skip connections
Skip connections between symmetric layer pairs.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Quantization
int5
bits: 5
scope: MLP
int6
bits: 6
scope: attention
int6
bits: 6
scope: embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"warmup_momentum":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"scope":"embeddings and scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.5,"checkpoint_every":50,"num_checkpoints":29}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmup + warmdown cosine decay
parameters: {"warmup_steps":20,"warmdown_steps":3000}
Regularization
weight decay
parameters: {"muon":0.04,"adamw":0.01}
Other
other
Use of BigramHash features and SmearGate in a PR #162 transformer stack with RoPE, RMSNorm, logit softcap, and GQA.
parameters: {"layers":10,"d_model":512,"vocab_size":1024}

Novel Contributions

  • Int5 quantization for MLP weights with Int6 quantization for attention weights under a 16 MB artifact budget.
  • Demonstration that sp1024 with 10 layers at d=512 outperformed larger-vocabulary sp8192 configurations.
  • Discovery that embedding tables can be quantized to Int6 with negligible quality loss.
  • Introduction of SmearGate and BigramHash within the PR #162 transformer stack.
  • Systematic architecture search across tokenizer sizes, widths, and depths with local Apple Silicon ablations and H100 confirmation.