PR #2027
openRecord: SP8192 QRescue + JEPA-Lite + LQER + Pergroup/lrzip + Legal TTT — val_bpb 1.08064
by H1cSuNtDr4C0n3SView on GitHub
val_bpb
1.0806
Architecture
Transformer
Optimizer
—
Artifact Size
15.70 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
GPTQ
bits: 6
scope: block weights
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
Uses recurrent depth layers in the base model lineage.
parameters: {"layers":[3,5]}
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Applies rotary position embeddings only partially.
parameters: {"ratio":"16/64"}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses widened MLP blocks in the base lineage.
parameters: {"multiplier":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Other
other
QRescue/Hessian SDClip layer-group multipliers for GPTQ threshold selection.
parameters: null
other
Training-side JEPA-Lite predictor removed before serialization.
parameters: null
other
LQER rank-4 residuals on selected projection matrices.
parameters: {"rank":4,"targets":["loop_mlp_proj","late_mlp_proj","attn_proj"]}
Compression
lrzip
level: 9
Test-Time Training
full TTT
parameters: {"protocol":"chunkwise_score_first_full_sgd","score_before_update":true,"no_rescore":true}
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Per-group artifact compression using system lrzip with lossless roundtrip verification
- Training-side JEPA-Lite with predictor removed before serialization
- LQER rank-4 residuals on selected projection matrices
- Legal score-first full-SGD TTT with chunkwise score-before-update and no-rescore protocol
- QRescue/Hessian SDClip multipliers for GPTQ threshold selection