PR #824
openGatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)
by sahiee-devView on GitHub
val_bpb
1.0896
Architecture
HedgeMixer
Optimizer
—
Artifact Size
14.9MB
Training Techniques
Architecture
GatedAttn
Per-head learned FP32 scalar gate multiplied against attention output to learn head-specific contribution magnitudes.
parameters: null
ValueResidual
Per-block learned FP32 scalar injects a fraction of the initial token embedding x0 directly into the residual stream.
parameters: null
XSA6
Uses the XSA6 architectural variant from the referenced baseline submission.
parameters: null
BigramHash4K
Includes BigramHash4K as part of the model stack/baseline architecture.
parameters: {"size":4096}
Test-Time Training
legal TTT
parameters: null
Evaluation
stride-based eval
parameters: {"stride":64}
Compression
zstd
level: 22
Novel Contributions
- Added gated attention with per-head learned FP32 scalar gates.
- Added value residual with per-block learned FP32 scalar injection from the initial embedding.
- Kept control tensors in FP32 to bypass GPTQ quantization.
- Applied legal test-time training (TTT) under Case 3 interpretation.
- Improved the baseline HedgeMixer stack from 1.1078 to a 1.08964536 mean val_bpb.