PR #1948

open

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)

by TimS-mlView on GitHub

val_bpb

1.0624

Architecture

Transformer

Optimizer

Adam

Artifact Size

~15.95 MB

Training Techniques

Architecture

LeakyReLU

Uses Leaky ReLU squared activation with slope 0.3 instead of 0.5.

parameters: {"slope":0.3}

weight tying

Tied embeddings are used for the vocabulary embeddings.

parameters: {"vocab_size":8192}

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dimensions":"16/64"}

depth recurrence

Loops layers 3-5 twice, activated at fraction 0.35.

parameters: {"layers":[3,4,5],"repeat":2,"activation_frac":0.35}

SmearGate

Uses SmearGate with BOS mask.

parameters: null

Gated Attention

Uses sparse attention gates and gated attention quantization for attn_gate_w.

parameters: null

Quantization

GPTQ

bits: 6

scope: all attn and MLP weights

GPTQ

bits: 7

scope: tok_emb.weight

int8

bits: 8

scope: attn_gate_w

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Test-Time Training

score-first TTT

parameters: {"phases":3,"batch_size":16,"prefix_docs_per_phase":2000,"optimizer":"Adam","learning_rate_peak":0.0001,"lora_rank":96}

LR Schedule

cosine decay

parameters: {"peak_lr":0.0001}

Compression

Brotli

level: 11

Novel Contributions

Leaky ReLU squared slope tuned from 0.5 to 0.3 for a free validation BPB gain.
Reverse-Cholesky plus triangular solve replaces the standard GPTQ Hinv path for a significant speedup.
Builds on PR #1938 with compliance-tuned defaults including smaller TTT batch size, more TTT phases, and GPTQ reserve time.
Uses GPTQ int6 for most weights, GPTQ int7 + LQER for token embeddings, and int8 quantization for attn_gate_w.