PR #1720

open

Notable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations

by kiyoakiView on GitHub

val_bpb

1.0818

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,989,432 bytes

Training Techniques

Architecture

depth recurrence

3-layer recurrence loops layers 3-5, yielding 17 virtual layers from 11 physical layers.

parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}

parallel residuals

GPT-J style parallel residuals where attention and MLP share the pre-residual input starting from layer 7.

parameters: {"start_layer":7}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"matrix_lr":0.026,"warmdown_frac":0.75}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.75}

Quantization

GPTQ

bits: 6

scope: MLP and attention matrices

mixed int6/int8

bits: null

scope: int6 for attention and MLP, int8 for embeddings

GPTQ

bits: 8

scope: token embeddings

Evaluation

sliding window eval

parameters: {"no_ttt":true}

Compression

Brotli

level: 11

Novel Contributions

5-seed near-frontier reference for the SP8192 / 3-layer recurrence / parallel residual stack
Empirical observation that post-quantization val_bpb is lower than pre-quantization across all 5 seeds
Documented SDClip ablation sweep showing quality/size tradeoffs and a negative-result clipping variant
Reference implementation under the 16 MB decimal artifact cap with all five seeds
Canonical SP8192 utf-8 byte-accounting validation with no TTT