PR #1799

open

Record: SP8192 + Headwise Gated Attention + LeakyReLU2 + Legal TTT (val_bpb 1.2073)

by jamesEmerson112View on GitHub

val_bpb

1.2073

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.34 MB

Training Techniques

Architecture

Gated Attention

Per-head sigmoid gate applied after SDPA to suppress or pass through each attention head's output dynamically.

parameters: {"type":"headwise","gates_per_head":1}

LeakyReLU

Uses LeakyReLU(0.5)^2 in the MLP instead of ReLU^2.

parameters: {"negative_slope":0.5,"squared":true}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied input embeddings and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_used_for":"scalars/embeddings"}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_tokens":32768,"grad_clip":1}

Sequence Length

sequence_length

train_length: 1024

eval_length: 32768

Other

other

Uses SP8192 SentencePiece BPE tokenizer/vocabulary.

parameters: {"vocab_size":8192}

Novel Contributions

Headwise gated attention as an original lightweight per-head gating mechanism
SP8192 tokenizer/vocabulary integration
LeakyReLU(0.5)^2 activation replacement
Legal score-first test-time training on already-scored chunks
Combination of SP8192, headwise gated attention, LeakyReLU2, QK-Gain 5.0, and TTT under the 16 MB budget