PR #960

open

Preliminary: 11L VRL + Full GPTQ + Parallel Muon + Legal TTT — val_bpb 1.1882 (ADIITJ)

val_bpb
1.1882
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
18,816,038 bytes

Training Techniques

Architecture
Value Residual
Layer 0 V output is blended into all subsequent layers via learned per-layer sigmoid gates.
parameters: {"layers":11}
BigramHash
Bigram hash embedding expanded from 1536 to 3072.
parameters: {"dimensions":3072}
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
Tight SWA
parameters: null
EMA
parameters: {"decay":0.999,"every_steps":10}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.008,"epochs":3,"chunk_size":256,"eval_seq_len":1024,"batch_size":64,"min_doc_len":512}
LR Schedule
cosine decay
parameters: {"start_lr":0.008,"end_lr":0.00001}
Regularization
logit softcap
parameters: {"value":30}
magnitude pruning
parameters: {"sparsity":0.03}
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization with scaled projection initialization.

Novel Contributions

  • Value Residual Learning (VRL) with learned per-layer sigmoid gates
  • Full GPTQ quantization with Hessian Cholesky int6 calibration and error propagation
  • BigramHash expansion from 1536 to 3072
  • Tight SWA over EMA when snapshots exist
  • Cosine LR annealing for LoRA TTT
  • Lower TTT base learning rate and shorter-document adaptation threshold