PR #2018

open

Record: Gated XSA + LQER top-1 + strict in-timer n-gram TTT (val_bpb: 1.046)

by simon-marcusView on GitHub

val_bpb

1.0462

Architecture

Transformer

Optimizer

—

Artifact Size

15,996,490 bytes

Training Techniques

Architecture

XSA

Gated XSA with a learned per-head scalar gate multiplying the XSA subtraction coefficient via tanh(xsa_alpha).

parameters: {"gated":true}

LeakyReLU

Uses LeakyReLU 0.3 in the base stack.

parameters: {"slope":0.3}

Quantization

GPTQ-lite

bits: null

scope: model artifact

Test-Time Training

score-first TTT

parameters: {"phased":true,"phases":1,"prefix_docs":1000}

LoRA TTT

parameters: {"rank":80,"local_lr_mult":0.75,"mask":"no_qv"}

Evaluation

n-gram tilt

parameters: {"precompute_inside_timer":true}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Regularization

logit softcap

parameters: {"asym_logit_rescale":true}

Other

other

LQER top-1 keeps only the best LQER correction tensor to reduce artifact size.

parameters: {"lqer_rank":4,"top_k":1,"asymmetric":true}

other

CaseOps SP8192 lossless-caps tokenizer and byte-sidecar dataset setup.

parameters: {"vocab_size":8192}