PR #1978

open

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item

by EthanYangTWView on GitHub

val_bpb

1.0778

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.00MB

Training Techniques

Architecture

depth recurrence

Universal Transformer-style extra recurrent cycles added on top of the base model.

parameters: {"layers":null,"cycles":"KS_UT_DEPTH"}

LeakyReLU

LeakyReLU-squared MLP activation used in fused megakernel path.

parameters: {"squared":true}

SmearGate

Causal content-gated residual connection, zero-init transparent.

parameters: null

Gated Attention

Per-head sigmoid gate on attention output.

parameters: {"width":12}

XSA

Cross/self-attention variant used across all layers.

parameters: {"layers":11}

GQA

Grouped query attention configuration.

parameters: {"query_heads":8,"kv_heads":4}

weight tying

Hash embedding / tied embedding style parameter sharing is implied by the submission's architecture notes.

parameters: null

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

mixed int6/int8

bits: null

scope: model

Optimizer

Muon

weight_decay: 0.095

momentum: 0.97

other_params: {"lr":0.022}

AdamW

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

long context eval

parameters: {"context_length":3072}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.01,"epochs":4}

Sequence Length

sequence_length

train_length: null

eval_length: 3072

LR Schedule

warmdown

parameters: {"warmdown_percent":66.7}

Regularization

logit softcap

parameters: {"value":30}

Initialization

OrthoInit

Frozen orthonormal A matrix for random linear adapters.

Other

other

Text diffusion via training-time embedding-noise auxiliary objective.

parameters: {"env_var":"KS_DIFFUSION_FRAC"}

other

Random linear adapters with frozen orthonormal A and learnable B.

parameters: {"env_var":"TTT_RLA_ENABLED"}

Novel Contributions

Composable scaffolding for multiple Requests-for-PRs items on top of PR #1953
Universal Transformer-style depth recurrence toggle
Megakernel exposure for fused LeakyReLU² MLP and softcapped CE Triton kernels
Long-context evaluation at 3072 tokens
Random linear adapters with frozen orthonormal A and learnable B
Text diffusion auxiliary training via embedding noise
Wired-but-blocked hooks for E2E TTT and JEPA
Stubbed scaffolding for SSM and H-net tokenization
Default configuration remains byte-identical to PR #1953