PR #2062

open

add non-record submission; 2026-04-30_SP8192_GPTQ-Embeddings_SDClip_Loop45x2_PLE_20min

by BumaldaOverTheWater94View on GitHub
val_bpb
1.2195
Architecture
Transformer
Optimizer
Muon
Artifact Size
20,886,863 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: matrix weights
Architecture
weight tying
Tied input and output embeddings.
parameters: null
depth recurrence
Uses looped/recurrent block execution with looping enabled.
parameters: {"loops":2,"loop_start":4,"loop_end":5}
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
PLE
Adds per-layer embeddings with gated injection into the model.
parameters: {"per_layer_embed_dim":64,"per_layer_embed_init_std":0.02}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"row_normalize":1}
Regularization
weight decay
parameters: {"muon_wd":0.085,"embed_wd":0.085}
logit softcap
parameters: {"matrix_clip_sigmas":12.85,"embed_clip_sigmas":20}
Compression
brotli
level: null
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Adds per-layer embeddings (PLE) to the SP8192 GPTQ embeddings + SDClip + Loop45x2 stack
  • Uses a learned token-side per-layer embedding table and model-side per-layer projection
  • Applies per-block gated PLE injection after attention and MLP updates
  • Exports rowwise int8 weights for the per-layer embedding table
  • Documents a non-record run that exceeds both the 10-minute training limit and the 16MB artifact cap