PR #113

closed

Record: FP16 Embed + Sliding Window Eval + Warmdown Tuning (pending eval)

by JoeProAIView on GitHub
val_bpb
1.1870
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
int8
bits: 8
scope: all weights except fp16 tok_emb.weight passthrough
Architecture
tied embeddings
Keeps embedding and output head tied; embedding tensor is preserved in fp16 during quantization because it is especially sensitive.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
long context eval
parameters: {"context_length":960}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Other
other
Learning-rate tuning with MATRIX_LR=0.06 to improve convergence under the wallclock cap.
parameters: {"matrix_lr":0.06}

Novel Contributions

  • FP16 embedding passthrough during int8 quantization to reduce post-quantization BPB degradation
  • Sliding window evaluation with stride 64 to score validation tokens with much longer context
  • Warmdown and learning-rate tuning for better convergence within the 10-minute wallclock limit
  • Combined submission integrating multiple previously proven improvements