PR #1970

open

Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100)

by bsisduckView on GitHub

val_bpb

1.0674

Architecture

Transformer

Optimizer

—

Artifact Size

15.89 MB

Training Techniques

Architecture

Gated Attention

Widened AttnOutGate input from 12 to 32 dimensions for per-head gating.

parameters: {"gate_width":32}

RoPE

Ablated rotary position embedding dimensionality.

parameters: {"dimensions":24}

RoPE

Ablated rotary position embedding dimensionality.

parameters: {"dimensions":32}

LeakyReLU

Changed activation slope to 0.3.

parameters: {"slope":0.3}

ReLU²

Changed activation to pure ReLU squared with zero leaky slope.

parameters: {"slope":0}

Quantization

int8

bits: 8

scope: embeddings

int6

bits: 6

scope: embeddings

Compression

brotli

level: null

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

clip sigmas

parameters: {"embed_clip_sigmas":14,"mlp_clip_sigmas":11.5}

Test-Time Training

TTT

parameters: {"beta2":0.999}

Novel Contributions

Systematic ablation of 10 configurations on the PR #1693 architecture with CaseOps SP8192.
Found that widening the attention gate to 32 dimensions improves both pre-quantization and post-TTT performance.
Showed that increasing RoPE dimensions hurts quantization robustness and increases the quantization gap.
Evaluated activation slope variants and found the default slope remains best on this stack.
Tested the PR #1855 hyperparameter stack and found it does not transfer to this architecture.
Demonstrated that int6 embeddings are required to fit under the 16MB limit without LQER.
Compared artifact compressors and found brotli better than LZMA for this submission.