PR #1973
closedSP8192 + Depth Recurrence + Parallel Residuals + TTT + SDCLIP + GPTQ-Brotli — 1.2192 BPB (LLMAdvisor.ai) [SUPERSEDED]
by harborglowvintage-ossView on GitHub
val_bpb
1.2193
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,457,746 bytes
Training Techniques
Architecture
BigramHash
Bespoke SentencePiece BPE vocabulary with 8192 tokens.
parameters: {"vocab_size":8192}
depth recurrence
Residual unrolling across selected layers.
parameters: {"layers":[3,4,5],"num_loops":2}
Parallel Residuals
Bypass shortcut used in later layers.
parameters: {"layers_start":7}
weight tying
Tied input/output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
ReLU²
Squared ReLU activation in the MLP.
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer.
parameters: null
Test-Time Training
TTT
parameters: {"epochs":1,"learning_rate":0.005,"momentum":0.9,"chunk_size":32000}
Other
other
Stable Divergence Clipping to prevent divergent TTT inference updates by clipping gradient steps when KL divergence exceeds a threshold.
parameters: {"steps":20}
Quantization
GPTQ
bits: 6
scope: model weights
Compression
Brotli
level: null
Initialization
OrthoInit
Orthogonal initialization with muP-scaled outputs.
Weight Averaging
SWA
parameters: {"every":30,"start_frac":0.5}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"embed_scalar_optimizer":"AdamW","embed_scalar_lr":0.02}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"magnitude_pruning":"3%"}
Novel Contributions
- SP8192 bespoke SentencePiece vocabulary
- Depth recurrence across layers 3-5
- Parallel residual bypass in later layers
- TTT with SDCLIP stabilization
- GPTQ int6 quantization with Brotli compression
- SWA-based checkpoint averaging