PR #339

open

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)

by sheeki03View on GitHub

val_bpb

1.1364

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

16.17 MB

Training Techniques

Quantization

mixed int6

bits: 6

scope: model weights

Architecture

Backout

Learned residual subtraction from a mid-network hidden state; subtracts lambda * h_mid from the final representation.

parameters: {"layer":5,"lambda_init":0.2}

SmearGate

Custom gating component used in the model architecture.

parameters: null

BigramHash

Bigram hashing component for vocabulary/features.

parameters: {"vocab_size":4096}

MLP3x

Expanded MLP width to 3x.

parameters: null

KV head count

Uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":6}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization used for the model.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Novel Contributions

Backout Connection: learned residual subtraction from a mid-network hidden state
Improved validation bpb relative to the PR #198 baseline on the same hardware/run setup
SWA with int6 mixed quantization and zstd compression
Potential future artifact-size reduction via INT5_MLP=1