PR #1324

open

Record: val_bpb 0.8275 (3-seed mean) — SLOT-28 + VRL + QK-Gain 4.0 + XSA-11

by yahya010View on GitHub
val_bpb
0.8275
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.5-15.8 MB

Training Techniques

Architecture
Value Residual
Adds value residual learning with sigmoid-gated interpolation.
parameters: {"init":-1.5}
XSA
Cross-head subtraction attention applied to all 11 layers.
parameters: {"layers":11}
BigramHash
Uses bigram hash embeddings.
parameters: {"dimensions":[1024,128]}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"train":16,"eval":64}
LeakyReLU
Uses LeakyReLU squared MLP activation.
parameters: {"squared":true,"negative_slope":0.5}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
VE128
Uses VE128 on selected layers.
parameters: {"layers":[9,10]}
U-Net skip connections
Adds U-Net style skip connections.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 6
scope: all
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"slot_steps":28}
LR Schedule
cosine decay
parameters: {"start":0.012,"end":0.001}
Evaluation
sliding window eval
parameters: {"stride":96}
Regularization
LN scale
parameters: null
Compression
lzma
level: 9

Novel Contributions

  • SLOT-28 eval-time optimization with 28 AdamW steps
  • Value Residual Learning with sigmoid-gated interpolation
  • Per-head learnable query scaling (QK-Gain 4.0)
  • Cross-head subtraction attention across all 11 layers
  • Frozen-model SLOT evaluation with per-window throwaway delta and logit bias