PR #2006
closedRecordLongCtx No-QV QK5.25 + AsymLogit — 1.05899 BPB 3-seed mean
by ElubrazioneView on GitHub
val_bpb
1.0590
Architecture
Transformer
Optimizer
—
Artifact Size
15,992,777 bytes
Training Techniques
Architecture
SmearGate
BOS-fixed SmearGate used in sparse attention gating.
parameters: null
weight tying
No explicit weight tying mentioned; not included.
parameters: null
attention modifications
Sparse attention gating with skip gates and LQER correction.
parameters: null
attention modifications
No-QV masking keeps K/O/MLP adaptation active while disabling QV adaptation.
parameters: null
Quantization
GPTQ
bits: null
scope: mixed-precision block weights
int7
bits: 7
scope: embeddings
Evaluation
long context eval
parameters: {"context_length":2560}
Sequence Length
sequence_length
train_length: 2560
eval_length: 2560
Test-Time Training
score-first TTT
parameters: {"masking":"No-QV","lora_rank":80,"local_lr_mult":0.75}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
logit softcap
parameters: {"asymmetric_logit_rescale":true}
Compression
lrzip
level: null
Novel Contributions
- Long-context No-QV QK5.25 configuration
- Asymmetric logit rescale at evaluation time
- CaseOps/SP8192 model with byte-sidecar BPB accounting
- Legal score-first TTT with K/O/MLP adaptation active
- Size-aware mixed-precision GPTQ and AWQ-lite protected quantization
- Locally tuned TTT and logit-rescale constants
- Per-group lrzip compression with artifact-size checks
- Three-seed clean rerun with sub-16MB artifacts