PR #1973

closed

SP8192 + Depth Recurrence + Parallel Residuals + TTT + SDCLIP + GPTQ-Brotli — 1.2192 BPB (LLMAdvisor.ai) [SUPERSEDED]

by harborglowvintage-ossView on GitHub

val_bpb

1.2193

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,457,746 bytes

Training Techniques

Architecture

BigramHash

Bespoke SentencePiece BPE vocabulary with 8192 tokens.

parameters: {"vocab_size":8192}

depth recurrence

Residual unrolling across selected layers.

parameters: {"layers":[3,4,5],"num_loops":2}

Parallel Residuals

Bypass shortcut used in later layers.

parameters: {"layers_start":7}

weight tying

Tied input/output embeddings.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

ReLU²

Squared ReLU activation in the MLP.

parameters: null

U-Net skip connections

U-Net style skip connections in the transformer.

parameters: null

Test-Time Training

TTT

parameters: {"epochs":1,"learning_rate":0.005,"momentum":0.9,"chunk_size":32000}

Other

other

Stable Divergence Clipping to prevent divergent TTT inference updates by clipping gradient steps when KL divergence exceeds a threshold.

parameters: {"steps":20}

Quantization

GPTQ

bits: 6

scope: model weights

Compression

Brotli

level: null

Initialization

OrthoInit

Orthogonal initialization with muP-scaled outputs.

Weight Averaging

SWA

parameters: {"every":30,"start_frac":0.5}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"embed_scalar_optimizer":"AdamW","embed_scalar_lr":0.02}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"magnitude_pruning":"3%"}

Novel Contributions

SP8192 bespoke SentencePiece vocabulary
Depth recurrence across layers 3-5
Parallel residual bypass in later layers
TTT with SDCLIP stabilization
GPTQ int6 quantization with Brotli compression
SWA-based checkpoint averaging