PR #1925

open

Record candidate: CaseOps + Matrix-LR 0.028 + Phased TTT 3500

by simon-marcusView on GitHub

val_bpb

1.0611

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.90 MB

Training Techniques

Architecture

CaseOps

CaseOps SP8192 tokenizer and byte-sidecar path with lossless caps reserved tokenizer.

parameters: {"vocab_size":8192}

XSA

11-layer 512d XSA stack with U-Net skips, parallel decoder, depth recurrence, SparseAttnGate, BOS-fixed SmearGate, and LeakyReLU(0.5)^2 MLP.

parameters: {"layers":11,"dimensions":512}

U-Net skip connections

Uses U-Net style skip connections in the stack.

parameters: null

depth recurrence

Includes recurrent depth structure in the model.

parameters: null

SmearGate

BOS-fixed SmearGate is used in the attention/stack design.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)^2 MLP activation.

parameters: {"negative_slope":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend_steps":5,"variant":"Polar-Express Newton-Schulz"}

Quantization

GPTQ

bits: 6

scope: matrices

mixed int7/int8

bits: null

scope: embeddings and row gate

LQER

bits: null

scope: asymmetric rank-4 correction

Compression

pergroup lrzip + brotli

level: null

Test-Time Training

score-first TTT

parameters: {"phased":true,"prefix_docs":3500,"num_phases":3,"chunk_size":48,"lora_rank":80}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.5}

Novel Contributions

Final push on the CaseOps/LQER/SmearGate stack while keeping the #1855 architecture intact.
Raised MATRIX_LR from 0.026 to 0.028.
Increased PHASED_TTT_PREFIX_DOCS to 3500 to use more of the eval budget.
Score-first phased TTT on the post-quant model, evaluating each chunk before adaptation.
Validated a 3-seed record-candidate run with mean val_bpb 1.06109 under the 16 MB artifact limit.