PR #1754

open

Add non-record submission: SP8192 baseline + LZMA code-wrap

by upascalView on GitHub
val_bpb
1.0881
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,988,151 bytes

Training Techniques

Architecture
depth recurrence
3-layer recurrence enabled partway through training
parameters: {"loop_start":3,"loop_end":5,"num_loops":2,"enabled_at":"50% training"}
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"negative_slope":0.5}
Quantization
GPTQ
bits: 6
scope: attn and mlp weights
GPTQ
bits: 8
scope: embeddings
mixed int6/int8
bits: null
scope: weights and embeddings
Other
other
SDClip per-row clipping using a std-based scale rule before quantization
parameters: {"int6_k":12.85,"int8_embedding_k":20}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"backend_steps":5,"momentum_warmup_steps":1500,"adam_on":["scalars","embeds"]}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_frac":0.3,"wallclock_cap_s":600}
Regularization
weight decay
parameters: {"muon":0.095,"adam":0.02}
Evaluation
sliding window eval
parameters: {"seed":1337}
Compression
lzma
level: null
brotli
level: 11

Novel Contributions

  • Single-seed reproduction of the SP8192 + int6 GPTQ + SDClip stack
  • LZMA code-wrap applied to the training source
  • Validated non-record baseline submission under the 16MB cap