PR #1667
openRECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139
by MarioPaerleView on GitHub
val_bpb
1.0714
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.927 MB
Training Techniques
Architecture
SmearGate
Input-dependent SmearGate reintroduced in a modded NanoGPT-style form.
parameters: {"width":12}
Gated Attention
Per-head multiplicative attention output gate applied to attention outputs.
parameters: {"width":12,"layers":11}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"negative_slope":0.5}
U-Net skip connections
Skip gates with sigmoid-gated U-Net style connections.
parameters: null
depth recurrence
Three-layer depth recurrence applied to selected layers.
parameters: {"layers":[3,4,5],"activated_frac":0.35}
parallel residuals
Parallel residual pathway used in later layers.
parameters: {"start_layer":7}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"formula":"1/sqrt(layer_idx+1)"}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"mlr":0.026,"ema":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.005,"epochs_per_chunk":3}
Novel Contributions
- Reintroduced SmearGate with input dependence
- Added per-head attention output gating
- Combined SmearGate, attention output gating, and score-first TTT in a single record
- Used GPTQ quantization with int6 matrices and int7 embeddings
- Applied Brotli-11 compression with byte-shuffle