PR #1636

open

Non-record: TWEO early-cosine outlier regularization on SP1024 baseline

by PapaFranku4647View on GitHub
val_bpb
1.2299
Architecture
Transformer
Optimizer
Artifact Size
15,890,375 bytes

Training Techniques

Regularization
TWEO
parameters: {"tau":5,"p":4,"lambda_start":0.0002,"lambda_final":0,"decay_steps":3000,"schedule":"cosine"}
LR Schedule
cosine decay
parameters: {"decay_steps":3000}
Quantization
int8
bits: 8
scope: artifact roundtrip
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Applied a lightweight TWEO-style train-time activation outlier regularizer to the SP1024 baseline
  • Found that a small early cosine-decayed TWEO pulse improved final int8+zlib BPB on matched 4h seed pairs
  • Showed that fixed TWEO and nonzero-tail TWEO variants hurt BPB in this setup
  • Reported directional confirmation on a 1×H100 80-minute matched seed pair
  • Observed strong suppression of post-block activation outliers with only a small BPB gain