val_bpb
1.0576
Architecture
Transformer
Optimizer
Muon
Artifact Size
<=16MB
Training Techniques
Architecture
SmearGate
BOS-fixed smear gate used in the attention stack.
parameters: null
Gated Attention
Sparse attention gate with a configurable recent-token window.
parameters: {"gate_window":8,"scale":0.5}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary position embeddings.
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped with recurrence enabled partway through training.
parameters: {"layers":[3,4,5],"frac":0.35}
Parallel decoder
Parallel lane begins at layer 8 and final lane is averaged.
parameters: {"start_layer":8}
Quantization
GPTQ
bits: 6
scope: matrix weights
int7
bits: 7
scope: embeddings
mixed int6/int8
bits: null
scope: model weights
Optimizer
Muon
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
pergroup
level: null
Evaluation
stride-based eval
parameters: {"stride":1536}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75,"mask":"no_qv","short_score_first":true,"short_doc_len":2000,"short_chunk_size":24,"prefix_docs":2500,"num_phases":1}
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"value":0.5}
Novel Contributions
- Changes only GATE_WINDOW from 12 to 8 relative to PR #2014.
- Keeps the PR #2014 stack otherwise byte-for-byte identical for a clean ablation.
- Uses full validation coverage with val_tokens matching target_tokens in all reported seeds.
- Applies score-first short-doc phased LoRA TTT on the CaseOps/SP8192 stack.