PR #364
openRecord: Batch-Optimized 524K + Warmdown 4000 (val_bpb 1.1497)
by shikhar1729View on GitHub
val_bpb
1.1497
Architecture
10L MLP3x
Optimizer
Muon
Artifact Size
15.93MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: all
Architecture
MLP3x
10-layer model with 3x MLP blocks as part of the base architecture.
parameters: {"layers":10}
SmearGate
Custom gating mechanism used in the model architecture.
parameters: null
BigramHash
Bigram hashing component with 10240 buckets.
parameters: {"buckets":10240}
Weight Averaging
SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization used for model weights.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":15,"freeze_blocks":0,"batch_seqs":16}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
Novel Contributions
- Increased training batch tokens to 524288 to obtain more optimizer steps per wall-clock minute.
- Retuned warmdown to 4000 iterations to match the higher step count from the smaller batch.
- Applied full-weight test-time training on the validation distribution after quantization roundtrip.
- Used sliding window evaluation with stride 64.
- Built on the prior #1 entry architecture with no code changes, only hyperparameter changes.