PR #1799

open

Record: SP8192 + Headwise Gated Attention + LeakyReLU2 + Legal TTT (val_bpb 1.2073)

by jamesEmerson112View on GitHub
val_bpb
1.2073
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.34 MB

Training Techniques

Architecture
Gated Attention
Per-head sigmoid gate applied after SDPA to suppress or pass through each attention head's output dynamically.
parameters: {"type":"headwise","gates_per_head":1}
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP instead of ReLU^2.
parameters: {"negative_slope":0.5,"squared":true}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input embeddings and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_used_for":"scalars/embeddings"}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_tokens":32768,"grad_clip":1}
Sequence Length
sequence_length
train_length: 1024
eval_length: 32768
Other
other
Uses SP8192 SentencePiece BPE tokenizer/vocabulary.
parameters: {"vocab_size":8192}

Novel Contributions

  • Headwise gated attention as an original lightweight per-head gating mechanism
  • SP8192 tokenizer/vocabulary integration
  • LeakyReLU(0.5)^2 activation replacement
  • Legal score-first test-time training on already-scored chunks
  • Combination of SP8192, headwise gated attention, LeakyReLU2, QK-Gain 5.0, and TTT under the 16 MB budget