PR #665
openAdd LLMAdvisor submission: 1.14638 BPB (track_10min_16mb)
by harborglowvintage-ossView on GitHub
val_bpb
1.1464
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,736,555 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights int5, attention weights int6, FP16 embeddings and last-layer key projections
Architecture
BigramHash
Hashes consecutive token pairs into a learned embedding table to capture local token-pair context.
parameters: {"dimensions":128,"buckets":10240}
SmearGate
Learned per-dimension gate blending current and previous token embeddings.
parameters: null
tied embeddings
Input and output embeddings are tied and stored in FP16.
parameters: null
KV head count
Uses grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.02,"scope":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"every":30,"start_frac":0.5,"num_averaged_checkpoints":49}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
Orthogonal
Orthogonal initialization with muP-scaled outputs.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Reduced batch size to increase training steps within the 600s wallclock budget.
parameters: {"batch_size_tokens":622592,"wallclock_seconds":600}
Novel Contributions
- Mixed int5 MLP / int6 attention quantization to fit a 10-layer model under the 16MB limit
- BigramHash(10240) token-pair embedding for local context
- SmearGate embedding blending mechanism
- Denser SWA collection ('SWA boost') with every=30 steps and start_frac=0.50
- Reduced batch size to increase the number of training steps within the 600-second budget