PR #2074

open

Add non-record loss-gated TTT 4xH100 run

by sanilb19View on GitHub

val_bpb

1.0884

Architecture

Transformer

Optimizer

—

Artifact Size

15,916,403 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"rank":80,"phases":3,"prefix_docs":2500}

LoRA TTT

parameters: {"rank":80}

Weight Averaging

EMA

parameters: {"decay":0.02}

Regularization

weight decay

parameters: {"value":0.5}

Architecture

SmearGate

Gated attention mechanism inherited from prior lineage

parameters: {"enabled":true}

Gated Attention

Attention gating with quant gate and smear gate components

parameters: {"quant_gate":true,"window":12}

Quantization

GPTQ

bits: 7

scope: embeddings

GPTQ

bits: 4

scope: LQER factors

Compression

lrzip

level: null

Optimizer

Muon

weight_decay: 0.5

momentum: 0.9

other_params: {"global_ttt_momentum":0.9,"warmup_steps":20,"muon_backend_steps":5}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Novel Contributions

Adaptive loss-gated weighting for legal score-first phased TTT
Each validation chunk is scored first before adaptation
Already-scored per-document losses are used to weight the same chunk's subsequent LoRA adaptation objective
Non-record 4xH100 reproduction of an above-baseline TTT variant built on PR #1855