PR #1948
openRecord: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)
by TimS-mlView on GitHub
val_bpb
1.0624
Architecture
Transformer
Optimizer
Adam
Artifact Size
~15.95 MB
Training Techniques
Architecture
LeakyReLU
Uses Leaky ReLU squared activation with slope 0.3 instead of 0.5.
parameters: {"slope":0.3}
weight tying
Tied embeddings are used for the vocabulary embeddings.
parameters: {"vocab_size":8192}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Loops layers 3-5 twice, activated at fraction 0.35.
parameters: {"layers":[3,4,5],"repeat":2,"activation_frac":0.35}
SmearGate
Uses SmearGate with BOS mask.
parameters: null
Gated Attention
Uses sparse attention gates and gated attention quantization for attn_gate_w.
parameters: null
Quantization
GPTQ
bits: 6
scope: all attn and MLP weights
GPTQ
bits: 7
scope: tok_emb.weight
int8
bits: 8
scope: attn_gate_w
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Test-Time Training
score-first TTT
parameters: {"phases":3,"batch_size":16,"prefix_docs_per_phase":2000,"optimizer":"Adam","learning_rate_peak":0.0001,"lora_rank":96}
LR Schedule
cosine decay
parameters: {"peak_lr":0.0001}
Compression
Brotli
level: 11
Novel Contributions
- Leaky ReLU squared slope tuned from 0.5 to 0.3 for a free validation BPB gain.
- Reverse-Cholesky plus triangular solve replaces the standard GPTQ Hinv path for a significant speedup.
- Builds on PR #1938 with compliance-tuned defaults including smaller TTT batch size, more TTT phases, and GPTQ reserve time.
- Uses GPTQ int6 for most weights, GPTQ int7 + LQER for token embeddings, and int8 quantization for attn_gate_w.