PR #1974

open

SP8192 + Depth Recurrence + Parallel Residuals + TTT + SDCLIP + GPTQ-Brotli — 1.2192 BPB (LLMAdvisor.ai)

by harborglowvintage-ossView on GitHub

val_bpb

1.2193

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,457,746 bytes

Training Techniques

Architecture

depth recurrence

Layers 3–5 use residual unrolling with NUM_LOOPS=2.

parameters: {"layers":[3,4,5],"num_loops":2}

parallel residuals

Parallel residual bypass applied to layers 7+.

parameters: {"layers_start":7}

GQA

Transformer uses 8 attention heads with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

GPTQ

bits: 6

scope: all

Compression

Brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":1,"learning_rate":0.005,"momentum":0.9,"chunk_tokens":32768}

Other

other

SDCLIP (Stable Divergence Clipping) stabilizes TTT inference updates by clipping steps when KL divergence exceeds a threshold.

parameters: {"steps":20}

Sequence Length

sequence_length

train_length: null

eval_length: 32768

LR Schedule

cosine decay

parameters: {"warmdown_fraction":0.72}

Weight Averaging

EMA

parameters: {"decay":0.995}

Regularization

logit softcap

parameters: {"value":20}

Novel Contributions

SP8192 bespoke SentencePiece BPE tokenizer
Depth recurrence in layers 3–5 with residual unrolling
Parallel residuals in later layers
Test-time training with SDCLIP stabilization
GPTQ int6 quantization combined with Brotli compression
Sliding-window evaluation with stride 64