PR #2031

open

Record support: canonical top-stack reproduction - val_bpb 1.05985

by deborahnelson8788726View on GitHub
val_bpb
1.0599
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,898,155 bytes

Training Techniques

Architecture
GQA
Grouped-query attention with 8 GQA heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Uses partial rotary positional embeddings with YaRN.
parameters: null
XSA
XSA applied across all layers.
parameters: {"layers":11}
U-Net skip connections
U-Net style skip connections in the transformer stack.
parameters: null
depth recurrence
Depth recurrence is used in the architecture.
parameters: null
LeakyReLU
LeakyReLU-square MLP activation.
parameters: null
SmearGate
BOS-fixed SmearGate used in the attention stack.
parameters: null
Gated Attention
Sparse attention head-output gating is enabled.
parameters: {"window":12}
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"backend_steps":5}
Quantization
GPTQ
bits: 6
scope: matrices
int7
bits: 7
scope: embeddings
int8
bits: 8
scope: attention gate
int4
bits: 4
scope: LQER correction
Compression
lrzip + brotli
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"phases":3,"prefix_docs":2500}
Regularization
LN scale
parameters: null
logit softcap
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_frac":0.85}

Novel Contributions

  • Canonical reproduction/support submission for an existing public stack rather than a new technique claim
  • Use of canonical pretokenized CaseOps shards from romeerp/parameter-golf-caseops-v1 instead of locally re-tokenized raw documents
  • Reproduction of the source stack's training, tokenizer, CaseOps pipeline, compression path, and hparam stack
  • Documents a single-seed canonical rerun achieving 1.05985469 val_bpb