PR #1372

open

V20: Cascaded 2-Phase L-BFGS Causal SLOT (1.00497 BPB, 3-seed)

by BortlesboatView on GitHub
val_bpb
1.0050
Architecture
Transformer
Optimizer
L-BFGS
Artifact Size
15,854,022 bytes

Training Techniques

Architecture
BigramHash
Bigram hash embedding component in the backbone stack.
parameters: {"dimensions":3072,"window":112}
Quantization
GPTQ
bits: 6
scope: full Hessian model
Compression
lzma
level: null
brotli
level: null
Test-Time Training
L-BFGS Causal SLOT
parameters: {"history_size":20,"causal_mask":true}
Cascaded 2-Phase L-BFGS
parameters: {"phase1_iters":5,"phase1_history":10,"phase2_iters":18,"phase2_history":20,"history_reset_between_phases":true}
Discriminative per-block pre-quant TTT
parameters: {"graduated_lr":"0.3x->1.0x","layer_groups":10}
Sequence Length
sequence_length
train_length: 128
eval_length: null
Other
other
Causal optimization mask restricted to already-scored tokens only.
parameters: {"opt_mask_range":"[focal_start, s)"}

Novel Contributions

  • Cascaded 2-phase L-BFGS with a coarse phase followed by a refinement phase
  • Resetting L-BFGS history between phases while warm-starting the delta tensor
  • Causal SLOT evaluation that only optimizes on already-scored tokens
  • Combining L-BFGS Causal SLOT with discriminative per-block pre-quant TTT
  • Reported faster eval with lower L-BFGS work than the single-phase baseline