PR #151
closedNon-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045)
by mrdavtanView on GitHub
val_bpb
1.2045
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,912,648 bytes
Training Techniques
Quantization
fp16
bits: 16
scope: embeddings
Architecture
tied embeddings
Uses tied input/output embeddings with FP16 export for the embedding path.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":20000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"backend_steps":5}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
doc-isolated eval
parameters: null
Regularization
gradient clipping
parameters: {"norm":1}
Other
other
Uses a longer training context and doc-isolated scoring to reduce cross-document context bleed.
parameters: {"train_batch_tokens":524288,"eval_batch_seqs":32}
Novel Contributions
- FP16 tied embedding export
- Aggressive warmdown with WARMDOWN_ITERS=20000
- Training with sequence length 2048
- Tuned learning rates and Muon optimizer settings
- Sliding window evaluation with stride 64
- Doc-isolated scoring