PR #178

closed

Add Nuclear Stack submission: 1.16668 BPB (seed 2884431328)

by timowhite88View on GitHub

val_bpb

1.1667

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.8MB

Training Techniques

Architecture

MLP3x

Uses 3x MLP expansion with ReLU² activation.

parameters: {"hidden":1536}

SmearGate

Learned gating that blends each token with the previous token.

parameters: null

BigramHash

2048-bucket hash table for token-pair context.

parameters: {"buckets":2048}

GQA

Grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: {"momentum_warmup":"0.92 -> 0.99"}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":"7-8"}

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

full TTT

parameters: {"epochs":2,"learning_rate":0.002,"frozen_blocks":4}

Initialization

Orthogonal init

Orthogonal initialization with muP scaling.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay

parameters: {"value":0.02}

Novel Contributions

Combines architectural improvements with test-time training in a single submission
Introduces SmearGate token blending
Introduces BigramHash token-pair context hashing
Uses 3x MLP expansion with ReLU² activation
Applies SWA over multiple checkpoints
Uses int6 mixed quantization with zstd compression
Performs honest sliding-window evaluation that avoids double-counting tokens
Applies full-model test-time training on validation data