PR #1915
openAdd SP8192 CaseOps + legal per-document TTT record (1.0650 BPB)
by AidenGeunGeunView on GitHub
val_bpb
1.0650
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,922,155 bytes
Training Techniques
Architecture
XSA
Frontier transformer architecture component used in the inherited stack.
parameters: null
SparseAttnGate
Sparse attention gating used in the inherited stack.
parameters: null
SmearGate
BOS-fixed smear gate used in the inherited stack.
parameters: null
depth recurrence
Recurrent transformer-style depth recurrence in the inherited stack.
parameters: null
Gated Attention
Attention gating with tuned gate scale and windowing.
parameters: {"gate_scale":0.5,"gate_window":12}
Quantization
GPTQ
bits: 6
scope: model weights
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 4
scope: top-k factors
GPTQ-lite
bits: null
scope: probe only
Compression
lzma
level: null
Brotli
level: null
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"beta2":0.99,"warmup_steps":20,"warmdown_frac":0.85}
AdamW
weight_decay: null
momentum: null
other_params: {"min_lr":0.1,"matrix_lr":0.026}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"warm_start_a":0,"score_before_update":true,"per_document_reset":true,"global_sgd":false}
Sequence Length
sequence_length
train_length: 48
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.5}
Other
other
SP8192 CaseOps tokenizer with byte sidecar and exact byte-denominator accounting.
parameters: null
other
Physical length bucketing used only as a batching optimization across independent documents.
parameters: {"bucketing":"global_length"}
Novel Contributions
- SP8192 CaseOps tokenizer with byte sidecar and exact byte accounting
- Legal per-document score-first LoRA TTT with no global SGD
- TTT_WARM_START_A=0 and per-document LoRA reset
- Stock top-k LQER after GPTQ
- Per-group lrzip/Brotli compressed GPTQ
- Self-extracting train_gpt.py wrapper for packaging within the artifact cap