val_bpb
1.0576
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.98 MB
Training Techniques
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"mask":"no_qv","local_lr_mult":0.75,"short_score_first":true}
Architecture
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16}
depth recurrence
Layers are looped recurrently in the middle of the network.
parameters: {"layers":[3,4,5],"frac":0.35}
XSA
Applies XSA across all layers.
parameters: {"layers":11}
SmearGate
BOS-fixed SmearGate gating is used.
parameters: null
Gated Attention
SparseAttnGate with gated attention behavior.
parameters: {"gate_window":12,"scale":0.5}
GQA
Grouped-query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_on_embedding_scalars":true,"beta2":0.99}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
mixed int7/int6
bits: 7
scope: embeddings and matrices
LQER
bits: 4
scope: correction
Compression
pergroup
level: null
Evaluation
stride-based eval
parameters: {"stride":1536,"context_length":3072}
Regularization
weight decay
parameters: {"value":0.5}
Novel Contributions
- Progressive training-context schedule from 1k to 3k context
- Short-document score-first TTT chunk schedule
- Long-context TTT mask removing Q/V adapters
- Combined recurrent-transformer RT-KV experiment on the CaseOps/SP8192 lineage
- Maintains full validation target coverage while staying under the artifact cap