PR #2027

open

Record: SP8192 QRescue + JEPA-Lite + LQER + Pergroup/lrzip + Legal TTT — val_bpb 1.08064

by H1cSuNtDr4C0n3SView on GitHub

val_bpb

1.0806

Architecture

Transformer

Optimizer

—

Artifact Size

15.70 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

GPTQ

bits: 6

scope: block weights

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

Uses recurrent depth layers in the base model lineage.

parameters: {"layers":[3,5]}

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Applies rotary position embeddings only partially.

parameters: {"ratio":"16/64"}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses widened MLP blocks in the base lineage.

parameters: {"multiplier":4}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Other

other

QRescue/Hessian SDClip layer-group multipliers for GPTQ threshold selection.

parameters: null

other

Training-side JEPA-Lite predictor removed before serialization.

parameters: null

other

LQER rank-4 residuals on selected projection matrices.

parameters: {"rank":4,"targets":["loop_mlp_proj","late_mlp_proj","attn_proj"]}

Compression

lrzip

level: 9

Test-Time Training

full TTT

parameters: {"protocol":"chunkwise_score_first_full_sgd","score_before_update":true,"no_rescore":true}

Evaluation

sliding window eval

parameters: null

Novel Contributions

Per-group artifact compression using system lrzip with lossless roundtrip verification
Training-side JEPA-Lite with predictor removed before serialization
LQER rank-4 residuals on selected projection matrices
Legal score-first full-SGD TTT with chunkwise score-before-update and no-rescore protocol
QRescue/Hessian SDClip multipliers for GPTQ threshold selection