PR #451

open

Add LLMAdvisor submission: 1.14638 BPB (track_10min_16mb)

by harborglowvintage-ossView on GitHub

val_bpb

1.1464

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,736,555 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights int5, attention weights int6, FP16 embeddings and last-layer key projections

Architecture

BigramHash

Hashes consecutive token pairs into a learned embedding table and projects to model dimension to capture local token-pair context.

parameters: {"buckets":10240,"dim":128}

SmearGate

Learned per-dimension gate blending current and previous token embeddings.

parameters: null

tied embeddings

Input and output embeddings are tied and stored in FP16.

parameters: null

KV head count

Uses grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02}

AdamW

weight_decay: null

momentum: null

other_params: {"lr":0.02,"scope":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"every":30,"start_frac":0.5,"num_averaged_checkpoints":49}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

Orthogonal

Orthogonal initialization with muP-scaled outputs.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Reduced batch size to increase step throughput within the 600s wallclock budget.

parameters: {"batch_size_tokens":622592}

Novel Contributions

Mixed int5 MLP / int6 attention quantization with FP16 embeddings to fit a 10-layer model under the 16MB limit.
BigramHash(10240) feature to inject local token-pair context.
SmearGate mechanism to blend current and previous token embeddings.
Denser SWA boost schedule (every=30 steps, start_frac=0.50) with 49 averaged checkpoints.
Reduced batch size to increase the number of training steps within the 600-second budget.