PR #2123

closed

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

by vaibhavmishra1View on GitHub
val_bpb
1.0593
Architecture
Transformer
Optimizer
Artifact Size
15,991,624 B

Training Techniques

Architecture
Gated XSA
Gated XSA transformer stack with zero-init per-head gates, looping layers 3-5, and a parallel final lane from layer 8.
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
GQA
Grouped-query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
SmearGate
SmearGate enabled in the attention stack.
parameters: null
Sparse Attention Gate
Sparse attention gating enabled.
parameters: null
Partial RoPE
Partial rotary positional embedding used.
parameters: null
n-gram tilt
Token-only, prefix-only in-timer n-gram tilt helper inlined into training/evaluation code.
parameters: {"enabled":1,"token_order":16,"token_threshold":0.8,"token_boost":2.625}
Regularization
logit softcap
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
GPTQ/LQER
bits: 6
scope: all
mixed int6/int7
bits: null
scope: embeddings and block weights
Test-Time Training
score-first TTT
parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":48,"prefix_docs":1000}
Sequence Length
sequence_length
train_length: null
eval_length: 2560

Novel Contributions

  • Gated XSA + CaseOps integrated stack
  • In-timer token-only n-gram tilt
  • LQER/GPTQ retune with rank 2, asym group 32, top-k 4
  • CaseOps SP8192 tokenizer and byte sidecar scoring
  • Score-first one-phase TTT with 1000 prefix docs
  • EMA before quantization