PR #2072
openRecord: SP8192 + BOS-Fix SmearGate + LQER Asym + Phased TTT (10L) — val_bpb 1.07171
by wfprocView on GitHub
val_bpb
1.0717
Architecture
Transformer
Optimizer
—
Artifact Size
15.37 MB
Training Techniques
Architecture
SmearGate
BOS-boundary fixed SmearGate attention/gating modification
parameters: null
weight tying
Layer looping / tied layer reuse across encoder-decoder style paths
parameters: {"layers":10}
KV head count
Transformer configuration with reduced KV heads
parameters: {"heads":8,"kv_heads":4}
Gated Attention
Sparse attention gate used in the model
parameters: {"scale":0.5}
Quantization
GPTQ
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Other
other
LQER Asymmetric quantization refinement
parameters: null
Test-Time Training
score-first TTT
parameters: {"rank":96,"phases":1}
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"min_lr":0.1}
Novel Contributions
- SP8192 tokenizer instead of SP1024
- Reduced to 10 transformer layers to fit the larger embedding table under the 16MB artifact limit
- BOS-fixed SmearGate with LQER Asymmetric and phased TTT stack
- Layer looping with sparse attention gating
- Achieved val_bpb 1.07171 with a 15.37 MB artifact