PR #678

open

Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD

by SPTholeView on GitHub
val_bpb
1.3525
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.15MB

Training Techniques

Initialization
SVD-based attention warm-start
Initializes layer-0 W_Q and W_K from bigram co-occurrence statistics via PMI-like preprocessing, random projection, and SVD so initial attention reflects token co-occurrence structure.
Architecture
tied embeddings
Token embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Uses a 3.0x MLP expansion (hidden size 1536 for model_dim 512).
parameters: {"mlp_mult":3,"hidden":1536}
RoPE
Uses full rotary positional embeddings.
parameters: {"rope_dims":64,"rope_base":10000}
skip connections
Uses U-Net style skip connections with encoder/decoder structure.
parameters: {"encoder":5,"decoder":6}
shared last layer
11-layer model with the last layer shared.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.025,"cyclic_momentum":"0.85-0.95","warmup_momentum":"0.92","warmup_steps":20}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":"0.035/0.025","scope":"embeds/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.2,"every":50}
Quantization
mixed int5/int6/int8
bits: null
scope: MLP int5, attention int6, bigram embeddings int6, token embeddings int8
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":64}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3500}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
AWQ activation-aware quantization calibration scales weight columns by activation importance before quantization and folds compensation into preceding LayerNorm.
parameters: {"calibration_batches":8,"alpha":0.5}
other
Cyclic Muon momentum uses a triangle wave between 0.85 and 0.95 with period 50 steps.
parameters: {"min":0.85,"max":0.95,"period":50}

Novel Contributions

  • Initializes layer-0 attention Q/K matrices from bigram co-occurrence statistics using PMI-like preprocessing and SVD.
  • Uses random projection plus SVD to map co-occurrence structure into model dimension for attention warm-starting.
  • Assigns SVD components to different heads to encourage head diversity across frequency bands.
  • Applies scale normalization so the initialized Q/K norms match default orthogonal initialization.
  • Combines AWQ with mixed-precision quantization for the final artifact.
  • Uses cyclic Muon momentum during training.