PR #2031

open

Record support: canonical top-stack reproduction - val_bpb 1.05985

by deborahnelson8788726View on GitHub

val_bpb

1.0599

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,898,155 bytes

Training Techniques

Architecture

GQA

Grouped-query attention with 8 GQA heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Uses partial rotary positional embeddings with YaRN.

parameters: null

XSA

XSA applied across all layers.

parameters: {"layers":11}

U-Net skip connections

U-Net style skip connections in the transformer stack.

parameters: null

depth recurrence

Depth recurrence is used in the architecture.

parameters: null

LeakyReLU

LeakyReLU-square MLP activation.

parameters: null

SmearGate

BOS-fixed SmearGate used in the attention stack.

parameters: null

Gated Attention

Sparse attention head-output gating is enabled.

parameters: {"window":12}

Optimizer

Muon

weight_decay: 0.5

momentum: 0.9

other_params: {"backend_steps":5}

Quantization

GPTQ

bits: 6

scope: matrices

int7

bits: 7

scope: embeddings

int8

bits: 8

scope: attention gate

int4

bits: 4

scope: LQER correction

Compression

lrzip + brotli

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":80,"phases":3,"prefix_docs":2500}

Regularization

LN scale

parameters: null

logit softcap

parameters: null

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_frac":0.85}

Novel Contributions

Canonical reproduction/support submission for an existing public stack rather than a new technique claim
Use of canonical pretokenized CaseOps shards from romeerp/parameter-golf-caseops-v1 instead of locally re-tokenized raw documents
Reproduction of the source stack's training, tokenizer, CaseOps pipeline, compression path, and hparam stack
Documents a single-seed canonical rerun achieving 1.05985469 val_bpb