PR #1915

open

Add SP8192 CaseOps + legal per-document TTT record (1.0650 BPB)

by AidenGeunGeunView on GitHub

val_bpb

1.0650

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,922,155 bytes

Training Techniques

Architecture

XSA

Frontier transformer architecture component used in the inherited stack.

parameters: null

SparseAttnGate

Sparse attention gating used in the inherited stack.

parameters: null

SmearGate

BOS-fixed smear gate used in the inherited stack.

parameters: null

depth recurrence

Recurrent transformer-style depth recurrence in the inherited stack.

parameters: null

Gated Attention

Attention gating with tuned gate scale and windowing.

parameters: {"gate_scale":0.5,"gate_window":12}

Quantization

GPTQ

bits: 6

scope: model weights

GPTQ

bits: 7

scope: embeddings

GPTQ

bits: 4

scope: top-k factors

GPTQ-lite

bits: null

scope: probe only

Compression

lzma

level: null

Brotli

level: null

Optimizer

Muon

weight_decay: 0.5

momentum: 0.9

other_params: {"beta2":0.99,"warmup_steps":20,"warmdown_frac":0.85}

AdamW

weight_decay: null

momentum: null

other_params: {"min_lr":0.1,"matrix_lr":0.026}

Test-Time Training

LoRA TTT

parameters: {"rank":80,"warm_start_a":0,"score_before_update":true,"per_document_reset":true,"global_sgd":false}

Sequence Length

sequence_length

train_length: 48

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.5}

Other

other

SP8192 CaseOps tokenizer with byte sidecar and exact byte-denominator accounting.

parameters: null

other

Physical length bucketing used only as a batching optimization across independent documents.

parameters: {"bucketing":"global_length"}

Novel Contributions

SP8192 CaseOps tokenizer with byte sidecar and exact byte accounting
Legal per-document score-first LoRA TTT with no global SGD
TTT_WARM_START_A=0 and per-document LoRA reset
Stock top-k LQER after GPTQ
Per-group lrzip/Brotli compressed GPTQ
Self-extracting train_gpt.py wrapper for packaging within the artifact cap