PR #2074

open

Add non-record loss-gated TTT 4xH100 run

by sanilb19View on GitHub
val_bpb
1.0884
Architecture
Transformer
Optimizer
Artifact Size
15,916,403 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: {"rank":80,"phases":3,"prefix_docs":2500}
LoRA TTT
parameters: {"rank":80}
Weight Averaging
EMA
parameters: {"decay":0.02}
Regularization
weight decay
parameters: {"value":0.5}
Architecture
SmearGate
Gated attention mechanism inherited from prior lineage
parameters: {"enabled":true}
Gated Attention
Attention gating with quant gate and smear gate components
parameters: {"quant_gate":true,"window":12}
Quantization
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 4
scope: LQER factors
Compression
lrzip
level: null
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"global_ttt_momentum":0.9,"warmup_steps":20,"muon_backend_steps":5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}

Novel Contributions

  • Adaptive loss-gated weighting for legal score-first phased TTT
  • Each validation chunk is scored first before adaptation
  • Already-scored per-document losses are used to weight the same chunk's subsequent LoRA adaptation objective
  • Non-record 4xH100 reproduction of an above-baseline TTT variant built on PR #1855