PR #2124

open

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

by vaibhavmishra1View on GitHub

val_bpb

1.0593

Architecture

Transformer

Optimizer

—

Artifact Size

15,991,624 B

Training Techniques

Architecture

XSA

Gated XSA transformer stack with zero-init per-head gates, looping layers, and a parallel final lane.

parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}

Gated Attention

Gated attention applied across all layers.

parameters: null

SmearGate

SmearGate enabled as part of the attention stack.

parameters: null

Partial RoPE

Partial RoPE used in the transformer.

parameters: null

GQA

Grouped-query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

mixed int6/int7

bits: null

scope: embeddings and block weights

Regularization

logit softcap

parameters: null

Test-Time Training

score-first TTT

parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":48,"prefix_docs":1000}

Evaluation

token-only n-gram tilt

parameters: {"enabled":1,"token_order":16,"token_threshold":0.8,"token_boost":2.625,"within_boost":0,"word_boost":0}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Other

other

CaseOps SP8192 lossless-caps tokenizer and byte sidecar validation scoring.

parameters: null

Novel Contributions

Integrated CaseOps/Gated-XSA stack with in-timer token-only n-gram tilt
LQER/GPTQ retune with g32/top4 settings
CaseOps SP8192 tokenizer and byte sidecar scoring
One-phase score-first TTT with 1000 prefix docs under the 600s cap
EMA before quantization
Self-contained evaluation logic with inlined n-gram helper