PR #783

open

Non-record: PR703 + shard-order curriculum + GPTQ cache-backout (1.1171)

by petergptView on GitHub

val_bpb

1.1171

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,909,560 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: full model / banked-attn and MLP surface

Architecture

weight tying

PR703-style branch with tied embeddings and a 11-layer trunk; includes cache/backout path and banked-attn/MLP surface.

parameters: {"layers":11,"bigram_vocab_size":1536,"cache_layer":7}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_quant_momentum":1,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3500,"iterations":9000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Score-ranked shard curriculum that reorders training shards using a lightweight scorer so harder shards are seen earlier.

parameters: null

Compression

lzma

level: null

Novel Contributions

Score-ranked shard-order curriculum
Tighter final int6 + lzma packing
GPTQ cache-backout branch carryover from PR703
Single-seed non-record submission under the 16MB cap