val_bpb
1.0591
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,984,508 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"local_lr_mult":0.85,"prefix_docs":2750,"num_phases":3,"mask":"no_qv","q_lora":0,"v_lora":0}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Quantization
GPTQ-lite
bits: 8
scope: model
Regularization
weight decay
parameters: {"weight_decay":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Architecture
SmearGate
Gate mechanism used in the lineage model stack.
parameters: {"window":12}
Gated Attention
Sparse attention gating used in the lineage model stack.
parameters: {"scale":0.5}
weight tying
Tied embeddings / weight tying in the lineage stack.
parameters: null
RoPE
Attention positional encoding used in the lineage stack.
parameters: null
Other
other
AWQ-lite post-training quantization used to fit the artifact budget.
parameters: {"enabled":true,"bits":8,"group_size":64}
other
Asymmetric logit rescaling used in the lineage stack.
parameters: {"enabled":true}
other
QK gain initialization used in the lineage stack.
parameters: {"qk_gain_init":5.25}
Novel Contributions
- Final-day legal TTT neighborhood selection with TTT_LOCAL_LR_MULT=0.85
- Prefix2750 score-first phased TTT evaluation
- Three-seed verification under strict 600s training/evaluation and 16MB artifact limits
- Continuation of the PR #1953 / PR #1945 lineage with legal phased TTT