PR #1754

open

Add non-record submission: SP8192 baseline + LZMA code-wrap

by upascalView on GitHub

val_bpb

1.0881

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,988,151 bytes

Training Techniques

Architecture

depth recurrence

3-layer recurrence enabled partway through training

parameters: {"loop_start":3,"loop_end":5,"num_loops":2,"enabled_at":"50% training"}

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU squared activation

parameters: {"squared":true,"negative_slope":0.5}

Quantization

GPTQ

bits: 6

scope: attn and mlp weights

GPTQ

bits: 8

scope: embeddings

mixed int6/int8

bits: null

scope: weights and embeddings

Other

other

SDClip per-row clipping using a std-based scale rule before quantization

parameters: {"int6_k":12.85,"int8_embedding_k":20}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"backend_steps":5,"momentum_warmup_steps":1500,"adam_on":["scalars","embeds"]}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_frac":0.3,"wallclock_cap_s":600}

Regularization

weight decay

parameters: {"muon":0.095,"adam":0.02}

Evaluation

sliding window eval

parameters: {"seed":1337}

Compression

lzma

level: null

brotli

level: 11

Novel Contributions

Single-seed reproduction of the SP8192 + int6 GPTQ + SDClip stack
LZMA code-wrap applied to the training source
Validated non-record baseline submission under the 16MB cap