PR #783

open

Non-record: PR703 + shard-order curriculum + GPTQ cache-backout (1.1171)

by petergptView on GitHub
val_bpb
1.1171
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,909,560 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: full model / banked-attn and MLP surface
Architecture
weight tying
PR703-style branch with tied embeddings and a 11-layer trunk; includes cache/backout path and banked-attn/MLP surface.
parameters: {"layers":11,"bigram_vocab_size":1536,"cache_layer":7}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_quant_momentum":1,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"iterations":9000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Score-ranked shard curriculum that reorders training shards using a lightweight scorer so harder shards are seen earlier.
parameters: null
Compression
lzma
level: null

Novel Contributions

  • Score-ranked shard-order curriculum
  • Tighter final int6 + lzma packing
  • GPTQ cache-backout branch carryover from PR703
  • Single-seed non-record submission under the 16MB cap