PR #824

open

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)

val_bpb

1.0896

Architecture

HedgeMixer

Optimizer

—

Artifact Size

14.9MB

Training Techniques

Architecture

GatedAttn

Per-head learned FP32 scalar gate multiplied against attention output to learn head-specific contribution magnitudes.

parameters: null

ValueResidual

Per-block learned FP32 scalar injects a fraction of the initial token embedding x0 directly into the residual stream.

parameters: null

XSA6

Uses the XSA6 architectural variant from the referenced baseline submission.

parameters: null

BigramHash4K

Includes BigramHash4K as part of the model stack/baseline architecture.

parameters: {"size":4096}

Test-Time Training

legal TTT

parameters: null

Evaluation

stride-based eval

parameters: {"stride":64}

Compression

zstd

level: 22

Added gated attention with per-head learned FP32 scalar gates.
Added value residual with per-block learned FP32 scalar injection from the initial embedding.
Kept control tensors in FP32 to bypass GPTQ quantization.
Applied legal test-time training (TTT) under Case 3 interpretation.
Improved the baseline HedgeMixer stack from 1.1078 to a 1.08964536 mean val_bpb.