PR #113
closedRecord: FP16 Embed + Sliding Window Eval + Warmdown Tuning (pending eval)
by JoeProAIView on GitHub
val_bpb
1.1870
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
int8
bits: 8
scope: all weights except fp16 tok_emb.weight passthrough
Architecture
tied embeddings
Keeps embedding and output head tied; embedding tensor is preserved in fp16 during quantization because it is especially sensitive.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
long context eval
parameters: {"context_length":960}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Other
other
Learning-rate tuning with MATRIX_LR=0.06 to improve convergence under the wallclock cap.
parameters: {"matrix_lr":0.06}
Novel Contributions
- FP16 embedding passthrough during int8 quantization to reduce post-quantization BPB degradation
- Sliding window evaluation with stride 64 to score validation tokens with much longer context
- Warmdown and learning-rate tuning for better convergence within the 10-minute wallclock limit
- Combined submission integrating multiple previously proven improvements