PR #113

closed

Record: FP16 Embed + Sliding Window Eval + Warmdown Tuning (pending eval)

val_bpb

1.1870

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

int8

bits: 8

scope: all weights except fp16 tok_emb.weight passthrough

Architecture

tied embeddings

Keeps embedding and output head tied; embedding tensor is preserved in fp16 during quantization because it is especially sensitive.

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

long context eval

parameters: {"context_length":960}

LR Schedule

warmdown

parameters: {"warmdown_steps":3600}

Other

other

Learning-rate tuning with MATRIX_LR=0.06 to improve convergence under the wallclock cap.

parameters: {"matrix_lr":0.06}

FP16 embedding passthrough during int8 quantization to reduce post-quantization BPB degradation
Sliding window evaluation with stride 64 to score validation tokens with much longer context
Warmdown and learning-rate tuning for better convergence within the 10-minute wallclock limit
Combined submission integrating multiple previously proven improvements