PR #2072

open

Record: SP8192 + BOS-Fix SmearGate + LQER Asym + Phased TTT (10L) — val_bpb 1.07171

val_bpb
1.0717
Architecture
Transformer
Optimizer
Artifact Size
15.37 MB

Training Techniques

Architecture
SmearGate
BOS-boundary fixed SmearGate attention/gating modification
parameters: null
weight tying
Layer looping / tied layer reuse across encoder-decoder style paths
parameters: {"layers":10}
KV head count
Transformer configuration with reduced KV heads
parameters: {"heads":8,"kv_heads":4}
Gated Attention
Sparse attention gate used in the model
parameters: {"scale":0.5}
Quantization
GPTQ
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Other
other
LQER Asymmetric quantization refinement
parameters: null
Test-Time Training
score-first TTT
parameters: {"rank":96,"phases":1}
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Novel Contributions

  • SP8192 tokenizer instead of SP1024
  • Reduced to 10 transformer layers to fit the larger embedding table under the 16MB artifact limit
  • BOS-fixed SmearGate with LQER Asymmetric and phased TTT stack
  • Layer looping with sparse attention gating
  • Achieved val_bpb 1.07171 with a 15.37 MB artifact