PR #2007

open

Record: LongCtx No-QV QK5.25 + AsymLogit — 1.05899 BPB 3-seed mean

by ElubrazioneView on GitHub

val_bpb

1.0590

Architecture

Transformer

Optimizer

—

Artifact Size

15,992,777 bytes

Training Techniques

Architecture

SmearGate

BOS-fixed SmearGate used with sparse attention gating and skip gates.

parameters: {"bos_fixed":true}

weight tying

CaseOps/SP8192 model uses tied/shared embedding-style setup is not explicitly stated; no weight tying was clearly mentioned.

parameters: null

Quantization

GPTQ

bits: null

scope: mixed precision

mixed int7/int8

bits: null

scope: embeddings and model weights

Evaluation

long context eval

parameters: {"context_length":2560}

Test-Time Training

score-first TTT

parameters: {"masking":"No-QV","lora_rank":80,"local_lr_mult":0.75}

Sequence Length

sequence_length

train_length: 2560

eval_length: 2560

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Regularization

logit softcap

parameters: {"asymmetric_logit_rescale":true}

Compression

lrzip

level: null

Other

other

CaseOps/SP8192 tokenization with byte-sidecar BPB accounting.

parameters: null

other

Per-group lrzip compression with artifact-size checks on every clean seed.

parameters: null

Novel Contributions

Long-context No-QV configuration with QK gain 5.25
Asymmetric logit rescale at evaluation time
Legal score-first TTT with No-QV masking
Size-aware mixed-precision quantization and AWQ-lite protected quantization
Three-seed clean rerun record with mean validation BPB 1.05899193
CaseOps/SP8192 tokenization with byte-sidecar BPB accounting
Per-group lrzip compression and artifact-size checks