PR #451
openAdd LLMAdvisor submission: 1.14638 BPB (track_10min_16mb)
by harborglowvintage-ossView on GitHub
val_bpb
1.1464
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,736,555 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights int5, attention weights int6, FP16 embeddings and last-layer key projections
Architecture
BigramHash
Hashes consecutive token pairs into a learned embedding table and projects to model dimension to capture local token-pair context.
parameters: {"buckets":10240,"dim":128}
SmearGate
Learned per-dimension gate blending current and previous token embeddings.
parameters: null
tied embeddings
Input and output embeddings are tied and stored in FP16.
parameters: null
KV head count
Uses grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.02,"scope":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"every":30,"start_frac":0.5,"num_averaged_checkpoints":49}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
Orthogonal
Orthogonal initialization with muP-scaled outputs.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Reduced batch size to increase step throughput within the 600s wallclock budget.
parameters: {"batch_size_tokens":622592}
Novel Contributions
- Mixed int5 MLP / int6 attention quantization with FP16 embeddings to fit a 10-layer model under the 16MB limit.
- BigramHash(10240) feature to inject local token-pair context.
- SmearGate mechanism to blend current and previous token embeddings.
- Denser SWA boost schedule (every=30 steps, start_frac=0.50) with 49 averaged checkpoints.
- Reduced batch size to increase the number of training steps within the 600-second budget.