PR #95

open

PROTEUS EMA — val_bpb: 1.1836 (3-seed mean, Notable Non-Record)

by MatoTeziTankaView on GitHub

val_bpb

1.1836

Architecture

Transformer

Optimizer

—

Artifact Size

15.88 MB

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.999,"dtype":"fp32","every_n_steps":10}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Architecture

tied embeddings

Uses tied token embedding and output head weights; embedding kept at FP16 for precision.

parameters: null

Quantization

fp16

bits: 16

scope: embeddings

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":3600}

Other

other

Hyperparameter tuning of matrix, scalar, and tied-embedding learning rates.

parameters: {"matrix_lr":0.06,"scalar_lr":0.06,"tied_embed_lr":0.04}

EMA weight averaging to reduce INT8 quantization loss
Longer training/evaluation sequence length (2048)
FP16 passthrough for tied embeddings while quantizing the rest of the model to INT8
Sliding-window evaluation with stride 64 for improved validation score
Documented negative results for INT4 post-training quantization and shared-weight depth recurrence (LoopFormer)