PR #364

open

Record: Batch-Optimized 524K + Warmdown 4000 (val_bpb 1.1497)

by shikhar1729View on GitHub

val_bpb

1.1497

Architecture

10L MLP3x

Optimizer

Muon

Artifact Size

15.93MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: all

Architecture

MLP3x

10-layer model with 3x MLP blocks as part of the base architecture.

parameters: {"layers":10}

SmearGate

Custom gating mechanism used in the model architecture.

parameters: null

BigramHash

Bigram hashing component with 10240 buckets.

parameters: {"buckets":10240}

Weight Averaging

SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization used for model weights.

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":15,"freeze_blocks":0,"batch_seqs":16}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

Increased training batch tokens to 524288 to obtain more optimizer steps per wall-clock minute.
Retuned warmdown to 4000 iterations to match the higher step count from the smaller batch.
Applied full-weight test-time training on the validation distribution after quantization roundtrip.
Used sliding window evaluation with stride 64.
Built on the prior #1 entry architecture with no code changes, only hyperparameter changes.