PR #225

open

Non-record: Int6 QAT + 11L 512d + Sliding Window, val_bpb=1.2089

val_bpb

1.2089

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,190,812 bytes

Training Techniques

Quantization

STE QAT

bits: 6

scope: large matrices / model weights

Architecture

MLP3x

11-layer Transformer with 512d hidden size and 1024 MLP hidden size; originally targeted 1536 MLP hidden size but reduced to fit budget.

parameters: {"layers":11,"dimensions":512,"mlp_hidden":1024}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.025,"warmdown":3000,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Evaluation

sliding window eval

parameters: {"context_length":4096,"chunk_size":512,"stride":64}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Compression

zlib

level: null

Other

other

Flat tensor storage for packed int6 bytes (int6_mixed_per_row_v2) to improve compression by avoiding pickle metadata interleaving.

parameters: {"format":"int6_mixed_per_row_v2"}

Flat tensor storage for packed int6 weights to improve zlib compression
STE fake-int6 QAT activated at step 200 with fp32 weight restore after backward
Sliding window evaluation with ctx=4096, chunk=512, stride=64
Tuned Muon optimizer settings for the 8×H100, 10-minute budget
Observation that more training steps can worsen compression due to near-orthogonal, high-entropy weights