val_bpb
1.0884
Architecture
Transformer
Optimizer
—
Artifact Size
15,916,403 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"rank":80,"phases":3,"prefix_docs":2500}
LoRA TTT
parameters: {"rank":80}
Weight Averaging
EMA
parameters: {"decay":0.02}
Regularization
weight decay
parameters: {"value":0.5}
Architecture
SmearGate
Gated attention mechanism inherited from prior lineage
parameters: {"enabled":true}
Gated Attention
Attention gating with quant gate and smear gate components
parameters: {"quant_gate":true,"window":12}
Quantization
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 4
scope: LQER factors
Compression
lrzip
level: null
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"global_ttt_momentum":0.9,"warmup_steps":20,"muon_backend_steps":5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Novel Contributions
- Adaptive loss-gated weighting for legal score-first phased TTT
- Each validation chunk is scored first before adaptation
- Already-scored per-document losses are used to weight the same chunk's subsequent LoRA adaptation objective
- Non-record 4xH100 reproduction of an above-baseline TTT variant built on PR #1855