PR #208

closed

Staging: Int6 MLP3x 11L + SmearGate + BigramHash4096x128 + MuonWD038 + SWA50 + DocSliding (single-run val_bpb=1.1568)

by ajkpersonalView on GitHub

val_bpb

1.1568

Architecture

Transformer

Optimizer

Muon

Artifact Size

15704854 bytes

Training Techniques

Quantization

int6

bits: 6

scope: artifact/model weights

Architecture

MLP3x

Expanded MLP width by 3x in an 11-layer dense-lexical KV4 model.

parameters: {"layers":11}

SmearGate

Added SmearGate to the model.

parameters: null

BigramHash

Added bigram hash features to the model.

parameters: {"dimensions":"4096x128"}

Optimizer

Muon

weight_decay: 0.038

momentum: null

other_params: {"adam_weight_decay":0.01}

Weight Averaging

SWA

parameters: {"every":50,"start_frac":0.5}

Evaluation

sliding window eval

parameters: {"context_length":2048,"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Compression

zstd

level: null

Regularization

weight decay

parameters: {"muon_weight_decay":0.038,"adam_weight_decay":0.01}

Novel Contributions

11-layer dense-lexical KV4 model with MLP3x
SmearGate architecture addition
BigramHash(4096x128) feature augmentation
Muon optimizer with weight decay 0.038 plus Adam weight decay 0.01
SWA every 50 steps starting at 50% of training
Legal re-export path using int6_zstd_core with doc_sliding 2048/256 to fit the artifact cap