PR #1851

RECORDopen

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT

by aquariouseworkmanView on GitHub
val_bpb
1.0613
Architecture
Transformer
Optimizer
Artifact Size
15,952,086 bytes

Training Techniques

Architecture
SmearGate
SmearGate attention mechanism with a BOS masking fix to prevent cross-document leakage.
parameters: {"window":12}
weight tying
Tied input and output embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
ReLU²
Uses squared ReLU-style activation in the MLP stack.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Layers 3-5 are looped twice with activation at a fractional depth.
parameters: {"layers":[3,4,5],"loops":2,"activated_at_frac":0.35}
XSA
XSA is applied across all layers.
parameters: {"layers":11}
KV head count
Uses grouped query-style asymmetric attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Quantization
GPTQ
bits: null
scope: full model
Compression
Brotli
level: null
Test-Time Training
score-first TTT
parameters: {"phases":3}

Novel Contributions

  • SmearGate BOS document boundary fix to prevent cross-document leakage
  • Combination of PR #1787 base stack with SmearGate and asymmetric LQER
  • Phased score-first test-time training
  • Record validation score of 1.06128 bpb