val_bpb
1.1215
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.56 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: weights
QAT
bits: 6
scope: weights
Architecture
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
XSA
Uses XSA in the last 4 layers.
parameters: {"layers":4}
SmearGate
Adds SmearGate to the MLP/activation path.
parameters: null
BigramHash
Adds a bigram hashing component with 2048 buckets.
parameters: {"buckets":2048}
MLP3x
Uses 3x MLP expansion with relu².
parameters: {"expansion":3}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4,"heads":8}
Weight Averaging
EMA
parameters: {"decay":0.995}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"epochs_per_chunk":3,"grad_clip":1}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"epochs":8,"learning_rate":0.002,"momentum":0.9}
LR Schedule
cosine decay
parameters: {"over_actual_training_window":true,"chunks":200}
Regularization
embedding freeze
parameters: {"frozen_components":["tok_emb","bigram","ve_shared"]}
Initialization
OrthoInit
Orthogonal initialization.
Novel Contributions
- GPTQ quantization with Hessian-aware error compensation for int6 per-row quantization
- Early QAT with matched clipping to the GPTQ export quantizer
- Legal score-first TTT with EMA scoring and cosine LR fix
- Embedding freezing during TTT
- Improved quantization tax from 0.0082 to 0.0058 BPB