PR #2072

open

Record: SP8192 + BOS-Fix SmearGate + LQER Asym + Phased TTT (10L) — val_bpb 1.07171

by wfprocView on GitHub

val_bpb

1.0717

Architecture

Transformer

Optimizer

—

Artifact Size

15.37 MB

Training Techniques

Architecture

SmearGate

BOS-boundary fixed SmearGate attention/gating modification

parameters: null

weight tying

Layer looping / tied layer reuse across encoder-decoder style paths

parameters: {"layers":10}

KV head count

Transformer configuration with reduced KV heads

parameters: {"heads":8,"kv_heads":4}

Gated Attention

Sparse attention gate used in the model

parameters: {"scale":0.5}

Quantization

GPTQ

bits: 6

scope: all

GPTQ

bits: 6

scope: all

Other

other

LQER Asymmetric quantization refinement

parameters: null

Test-Time Training

score-first TTT

parameters: {"rank":96,"phases":1}

Compression

brotli

level: 11

Sequence Length

sequence_length

train_length: null

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Novel Contributions

SP8192 tokenizer instead of SP1024
Reduced to 10 transformer layers to fit the larger embedding table under the 16MB artifact limit
BOS-fixed SmearGate with LQER Asymmetric and phased TTT stack
Layer looping with sparse attention gating
Achieved val_bpb 1.07171 with a 15.37 MB artifact