PR #151

closed

Non-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045)

by mrdavtanView on GitHub
val_bpb
1.2045
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,912,648 bytes

Training Techniques

Quantization
fp16
bits: 16
scope: embeddings
Architecture
tied embeddings
Uses tied input/output embeddings with FP16 export for the embedding path.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":20000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"backend_steps":5}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
doc-isolated eval
parameters: null
Regularization
gradient clipping
parameters: {"norm":1}
Other
other
Uses a longer training context and doc-isolated scoring to reduce cross-document context bleed.
parameters: {"train_batch_tokens":524288,"eval_batch_seqs":32}

Novel Contributions

  • FP16 tied embedding export
  • Aggressive warmdown with WARMDOWN_ITERS=20000
  • Training with sequence length 2048
  • Tuned learning rates and Muon optimizer settings
  • Sliding window evaluation with stride 64
  • Doc-isolated scoring