PR #1970
openAblation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100)
by bsisduckView on GitHub
val_bpb
1.0674
Architecture
Transformer
Optimizer
—
Artifact Size
15.89 MB
Training Techniques
Architecture
Gated Attention
Widened AttnOutGate input from 12 to 32 dimensions for per-head gating.
parameters: {"gate_width":32}
RoPE
Ablated rotary position embedding dimensionality.
parameters: {"dimensions":24}
RoPE
Ablated rotary position embedding dimensionality.
parameters: {"dimensions":32}
LeakyReLU
Changed activation slope to 0.3.
parameters: {"slope":0.3}
ReLU²
Changed activation to pure ReLU squared with zero leaky slope.
parameters: {"slope":0}
Quantization
int8
bits: 8
scope: embeddings
int6
bits: 6
scope: embeddings
Compression
brotli
level: null
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
clip sigmas
parameters: {"embed_clip_sigmas":14,"mlp_clip_sigmas":11.5}
Test-Time Training
TTT
parameters: {"beta2":0.999}
Novel Contributions
- Systematic ablation of 10 configurations on the PR #1693 architecture with CaseOps SP8192.
- Found that widening the attention gate to 32 dimensions improves both pre-quantization and post-TTT performance.
- Showed that increasing RoPE dimensions hurts quantization robustness and increases the quantization gap.
- Evaluated activation slope variants and found the default slope remains best on this stack.
- Tested the PR #1855 hyperparameter stack and found it does not transfer to this architecture.
- Demonstrated that int6 embeddings are required to fit under the 16MB limit without LQER.
- Compared artifact compressors and found brotli better than LZMA for this submission.