PR #1794

open

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849

by ProgrammerryokiView on GitHub

val_bpb

1.0849

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,962,961 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP, attention, embeddings

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Uses rotary position embeddings on a subset of dimensions.

parameters: {"dimensions":"16/64"}

depth recurrence

Reuses layers in a recurrent encoder/decoder pattern.

parameters: {"layers":"3-5"}

ReLU²

Uses squared ReLU activation in the MLP.

parameters: null

XSA

Applies XSA in the last layers.

parameters: {"layers":11}

Gated Attention

Adds per-head attention output gating with zero-init identity.

parameters: {"gate":"2σ(attn_gate)"}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start_momentum":0.92}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings + scalars"}

Test-Time Training

score-first TTT

parameters: {"freeze_blocks":0,"learning_rate":0.01,"epochs":5,"momentum":0.9}

LR Schedule

cosine decay

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Other

other

Eval wall-clock budget guard that truncates adaptation when projected runtime would exceed the 600s limit while continuing scoring.

parameters: {"max_eval_seconds":600,"warmup_chunks":5}

Novel Contributions

Per-layer GPTQ clip sigmas tuned separately for MLP, attention, and embeddings
Unfrozen score-first TTT with all non-embedding blocks adapting
Evaluation wall-clock budget guard that truncates adaptation while preserving scoring legality
Recipe built on SP8192 base architecture with per-category outlier-aware quantization