PR #95

open

PROTEUS EMA — val_bpb: 1.1836 (3-seed mean, Notable Non-Record)

by MatoTeziTankaView on GitHub
val_bpb
1.1836
Architecture
Transformer
Optimizer
Artifact Size
15.88 MB

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.999,"dtype":"fp32","every_n_steps":10}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Architecture
tied embeddings
Uses tied token embedding and output head weights; embedding kept at FP16 for precision.
parameters: null
Quantization
fp16
bits: 16
scope: embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3600}
Other
other
Hyperparameter tuning of matrix, scalar, and tied-embedding learning rates.
parameters: {"matrix_lr":0.06,"scalar_lr":0.06,"tied_embed_lr":0.04}

Novel Contributions

  • EMA weight averaging to reduce INT8 quantization loss
  • Longer training/evaluation sequence length (2048)
  • FP16 passthrough for tied embeddings while quantizing the rest of the model to INT8
  • Sliding-window evaluation with stride 64 for improved validation score
  • Documented negative results for INT4 post-training quantization and shared-weight depth recurrence (LoopFormer)