PR #1978
openNon-record: GolfParty — composable scaffolding for every Requests-for-PRs item
by EthanYangTWView on GitHub
val_bpb
1.0778
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.00MB
Training Techniques
Architecture
depth recurrence
Universal Transformer-style extra recurrent cycles added on top of the base model.
parameters: {"layers":null,"cycles":"KS_UT_DEPTH"}
LeakyReLU
LeakyReLU-squared MLP activation used in fused megakernel path.
parameters: {"squared":true}
SmearGate
Causal content-gated residual connection, zero-init transparent.
parameters: null
Gated Attention
Per-head sigmoid gate on attention output.
parameters: {"width":12}
XSA
Cross/self-attention variant used across all layers.
parameters: {"layers":11}
GQA
Grouped query attention configuration.
parameters: {"query_heads":8,"kv_heads":4}
weight tying
Hash embedding / tied embedding style parameter sharing is implied by the submission's architecture notes.
parameters: null
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
mixed int6/int8
bits: null
scope: model
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
long context eval
parameters: {"context_length":3072}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.01,"epochs":4}
Sequence Length
sequence_length
train_length: null
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_percent":66.7}
Regularization
logit softcap
parameters: {"value":30}
Initialization
OrthoInit
Frozen orthonormal A matrix for random linear adapters.
Other
other
Text diffusion via training-time embedding-noise auxiliary objective.
parameters: {"env_var":"KS_DIFFUSION_FRAC"}
other
Random linear adapters with frozen orthonormal A and learnable B.
parameters: {"env_var":"TTT_RLA_ENABLED"}
Novel Contributions
- Composable scaffolding for multiple Requests-for-PRs items on top of PR #1953
- Universal Transformer-style depth recurrence toggle
- Megakernel exposure for fused LeakyReLU² MLP and softcapped CE Triton kernels
- Long-context evaluation at 3072 tokens
- Random linear adapters with frozen orthonormal A and learnable B
- Text diffusion auxiliary training via embedding noise
- Wired-but-blocked hooks for E2E TTT and JEPA
- Stubbed scaffolding for SSM and H-net tokenization
- Default configuration remains byte-identical to PR #1953