PR #1974
openSP8192 + Depth Recurrence + Parallel Residuals + TTT + SDCLIP + GPTQ-Brotli — 1.2192 BPB (LLMAdvisor.ai)
by harborglowvintage-ossView on GitHub
val_bpb
1.2193
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,457,746 bytes
Training Techniques
Architecture
depth recurrence
Layers 3–5 use residual unrolling with NUM_LOOPS=2.
parameters: {"layers":[3,4,5],"num_loops":2}
parallel residuals
Parallel residual bypass applied to layers 7+.
parameters: {"layers_start":7}
GQA
Transformer uses 8 attention heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":1,"learning_rate":0.005,"momentum":0.9,"chunk_tokens":32768}
Other
other
SDCLIP (Stable Divergence Clipping) stabilizes TTT inference updates by clipping steps when KL divergence exceeds a threshold.
parameters: {"steps":20}
Sequence Length
sequence_length
train_length: null
eval_length: 32768
LR Schedule
cosine decay
parameters: {"warmdown_fraction":0.72}
Weight Averaging
EMA
parameters: {"decay":0.995}
Regularization
logit softcap
parameters: {"value":20}
Novel Contributions
- SP8192 bespoke SentencePiece BPE tokenizer
- Depth recurrence in layers 3–5 with residual unrolling
- Parallel residuals in later layers
- Test-time training with SDCLIP stabilization
- GPTQ int6 quantization combined with Brotli compression
- Sliding-window evaluation with stride 64