PR #828

open

Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03

by bigbagView on GitHub

val_bpb

0.9076

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.32 MB

Training Techniques

Architecture

MLP3x

10-layer Transformer with 3x LeakyReLU MLP blocks

parameters: {"layers":10,"d_model":512}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

BigramHash

Bigram hash-based component used in the model

parameters: {"buckets":4096,"dim":128}

SmearGate

Gating mechanism included in the architecture

parameters: null

Value Residual

Residual pathway applied to values

parameters: null

Gated Attention

Attention mechanism with gating

parameters: null

XSA

XSA used in the last 4 layers

parameters: {"layers":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":"16/64"}

LN Scale

LayerNorm scaling included in the architecture

parameters: null

U-Net skip connections

Skip connections inspired by U-Net added to the Transformer

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

logit softcap

Logit softcapping applied to outputs

parameters: {"value":30}

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"lr":0.03}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

multi-order n-gram backoff

parameters: {"orders":"2-7","score_first":true,"backward_looking":true,"entropy_adaptive_alpha":true}

Other

other

Systematic hyperparameter screening to find MATRIX_LR=0.03 as the best setting

parameters: {"experiments":74,"screening_steps":"10-12"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Improved the previous PR #802 result by changing MATRIX_LR from 0.02 to 0.03
Systematic hyperparameter screening identified MATRIX_LR=0.03 as the strongest training hyperparameter improvement
Uses multi-order n-gram backoff evaluation with score-first backward-looking cache
Entropy-adaptive alpha mixing for n-gram backoff evaluation
Combines a 10-layer Transformer with BigramHash, SmearGate, value residuals, gated attention, and mixed int5/int6 quantization