PR #2123

closed

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

by vaibhavmishra1View on GitHub

val_bpb

1.0593

Architecture

Transformer

Optimizer

—

Artifact Size

15,991,624 B

Training Techniques

Architecture

Gated XSA

Gated XSA transformer stack with zero-init per-head gates, looping layers 3-5, and a parallel final lane from layer 8.

parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}

GQA

Grouped-query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

SmearGate

SmearGate enabled in the attention stack.

parameters: null

Sparse Attention Gate

Sparse attention gating enabled.

parameters: null

Partial RoPE

Partial rotary positional embedding used.

parameters: null

n-gram tilt

Token-only, prefix-only in-timer n-gram tilt helper inlined into training/evaluation code.

parameters: {"enabled":1,"token_order":16,"token_threshold":0.8,"token_boost":2.625}

Regularization

logit softcap

parameters: null

Weight Averaging

EMA

parameters: null

Quantization

GPTQ/LQER

bits: 6

scope: all

mixed int6/int7

bits: null

scope: embeddings and block weights

Test-Time Training

score-first TTT

parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":48,"prefix_docs":1000}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Novel Contributions

Gated XSA + CaseOps integrated stack
In-timer token-only n-gram tilt
LQER/GPTQ retune with rank 2, asym group 32, top-k 4
CaseOps SP8192 tokenizer and byte sidecar scoring
Score-first one-phase TTT with 1000 prefix docs
EMA before quantization