PR #95
openPROTEUS EMA — val_bpb: 1.1836 (3-seed mean, Notable Non-Record)
by MatoTeziTankaView on GitHub
val_bpb
1.1836
Architecture
Transformer
Optimizer
—
Artifact Size
15.88 MB
Training Techniques
Weight Averaging
EMA
parameters: {"decay":0.999,"dtype":"fp32","every_n_steps":10}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Architecture
tied embeddings
Uses tied token embedding and output head weights; embedding kept at FP16 for precision.
parameters: null
Quantization
fp16
bits: 16
scope: embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3600}
Other
other
Hyperparameter tuning of matrix, scalar, and tied-embedding learning rates.
parameters: {"matrix_lr":0.06,"scalar_lr":0.06,"tied_embed_lr":0.04}
Novel Contributions
- EMA weight averaging to reduce INT8 quantization loss
- Longer training/evaluation sequence length (2048)
- FP16 passthrough for tied embeddings while quantizing the rest of the model to INT8
- Sliding-window evaluation with stride 64 for improved validation score
- Documented negative results for INT4 post-training quantization and shared-weight depth recurrence (LoopFormer)