PR #2062

open

add non-record submission; 2026-04-30_SP8192_GPTQ-Embeddings_SDClip_Loop45x2_PLE_20min

by BumaldaOverTheWater94View on GitHub

val_bpb

1.2195

Architecture

Transformer

Optimizer

Muon

Artifact Size

20,886,863 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrix weights

Architecture

weight tying

Tied input and output embeddings.

parameters: null

depth recurrence

Uses looped/recurrent block execution with looping enabled.

parameters: {"loops":2,"loop_start":4,"loop_end":5}

KV head count

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

PLE

Adds per-layer embeddings with gated injection into the model.

parameters: {"per_layer_embed_dim":64,"per_layer_embed_init_std":0.02}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.085

momentum: null

other_params: {"row_normalize":1}

Regularization

weight decay

parameters: {"muon_wd":0.085,"embed_wd":0.085}

logit softcap

parameters: {"matrix_clip_sigmas":12.85,"embed_clip_sigmas":20}

Compression

brotli

level: null

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Adds per-layer embeddings (PLE) to the SP8192 GPTQ embeddings + SDClip + Loop45x2 stack
Uses a learned token-side per-layer embedding table and model-side per-layer projection
Applies per-block gated PLE injection after attention and MLP updates
Exports rowwise int8 weights for the per-layer embedding table
Documents a non-record run that exceeds both the 10-minute training limit and the 16MB artifact cap