PR #960

open

Preliminary: 11L VRL + Full GPTQ + Parallel Muon + Legal TTT — val_bpb 1.1882 (ADIITJ)

by ADIITJView on GitHub

val_bpb

1.1882

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

18,816,038 bytes

Training Techniques

Architecture

Value Residual

Layer 0 V output is blended into all subsequent layers via learned per-layer sigmoid gates.

parameters: {"layers":11}

BigramHash

Bigram hash embedding expanded from 1536 to 3072.

parameters: {"dimensions":3072}

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

Tight SWA

parameters: null

EMA

parameters: {"decay":0.999,"every_steps":10}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.008,"epochs":3,"chunk_size":256,"eval_seq_len":1024,"batch_size":64,"min_doc_len":512}

LR Schedule

cosine decay

parameters: {"start_lr":0.008,"end_lr":0.00001}

Regularization

logit softcap

parameters: {"value":30}

magnitude pruning

parameters: {"sparsity":0.03}

Compression

zstd

level: 22

Initialization

OrthoInit

Orthogonal initialization with scaled projection initialization.

Novel Contributions

Value Residual Learning (VRL) with learned per-layer sigmoid gates
Full GPTQ quantization with Hessian Cholesky int6 calibration and error propagation
BigramHash expansion from 1536 to 3072
Tight SWA over EMA when snapshots exist
Cosine LR annealing for LoRA TTT
Lower TTT base learning rate and shorter-document adaptation threshold