PR #164

open

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)

by jfprinczView on GitHub

val_bpb

1.1524

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4 MB

Training Techniques

Initialization

OrthoInit

Orthogonal initialization for large matrices with muP-style scaling of output projections by 1/sqrt(2 * layers) to improve early convergence.

Architecture

MLP3x

Expanded MLP hidden size to 1536, increasing model capacity.

parameters: {"hidden_size":1536}

SmearGate

Learned sigmoid gate blending each token embedding with the previous token embedding before the first transformer layer.

parameters: null

BigramHash

Hash-based bigram embedding table injecting token-pair features.

parameters: {"buckets":4096,"input_dim":128,"output_dim":512}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6; embeddings and bigram int8; controls fp32

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmup_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"grad_clip":0.3}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":256}

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":1500,"warmdown_steps":3000}

Compression

zstd

level: 22

Novel Contributions

Orthogonal + muP-scaled initialization for faster early convergence
3x wider MLP to increase capacity within the artifact budget
Mixed int6/int8 quantization to reduce artifact size
SmearGate token embedding blending with previous-token context
BigramHash embedding for token-pair feature injection
Tuned Muon optimizer settings with warmup and warmdown
Training and evaluation at 2048-token sequence length with NTK-aware RoPE
FlashAttention 3 integration for faster training steps