PR #665

open

Add LLMAdvisor submission: 1.14638 BPB (track_10min_16mb)

by harborglowvintage-ossView on GitHub

val_bpb

1.1464

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,736,555 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights int5, attention weights int6, FP16 embeddings and last-layer key projections

Architecture

BigramHash

Hashes consecutive token pairs into a learned embedding table to capture local token-pair context.

parameters: {"dimensions":128,"buckets":10240}

SmearGate

Learned per-dimension gate blending current and previous token embeddings.

parameters: null

tied embeddings

Input and output embeddings are tied and stored in FP16.

parameters: null

KV head count

Uses grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02}

AdamW

weight_decay: null

momentum: null

other_params: {"lr":0.02,"scope":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"every":30,"start_frac":0.5,"num_averaged_checkpoints":49}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

Orthogonal

Orthogonal initialization with muP-scaled outputs.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Reduced batch size to increase training steps within the 600s wallclock budget.

parameters: {"batch_size_tokens":622592,"wallclock_seconds":600}

Novel Contributions

Mixed int5 MLP / int6 attention quantization to fit a 10-layer model under the 16MB limit
BigramHash(10240) token-pair embedding for local context
SmearGate embedding blending mechanism
Denser SWA collection ('SWA boost') with every=30 steps and start_frac=0.50
Reduced batch size to increase the number of training steps within the 600-second budget