PR #2124

open

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

by vaibhavmishra1View on GitHub
val_bpb
1.0593
Architecture
Transformer
Optimizer
Artifact Size
15,991,624 B

Training Techniques

Architecture
XSA
Gated XSA transformer stack with zero-init per-head gates, looping layers, and a parallel final lane.
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
Gated Attention
Gated attention applied across all layers.
parameters: null
SmearGate
SmearGate enabled as part of the attention stack.
parameters: null
Partial RoPE
Partial RoPE used in the transformer.
parameters: null
GQA
Grouped-query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
mixed int6/int7
bits: null
scope: embeddings and block weights
Regularization
logit softcap
parameters: null
Test-Time Training
score-first TTT
parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":48,"prefix_docs":1000}
Evaluation
token-only n-gram tilt
parameters: {"enabled":1,"token_order":16,"token_threshold":0.8,"token_boost":2.625,"within_boost":0,"word_boost":0}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Other
other
CaseOps SP8192 lossless-caps tokenizer and byte sidecar validation scoring.
parameters: null

Novel Contributions

  • Integrated CaseOps/Gated-XSA stack with in-timer token-only n-gram tilt
  • LQER/GPTQ retune with g32/top4 settings
  • CaseOps SP8192 tokenizer and byte sidecar scoring
  • One-phase score-first TTT with 1000 prefix docs under the 600s cap
  • EMA before quantization
  • Self-contained evaluation logic with inlined n-gram helper