PR #151

closed

Non-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045)

val_bpb

1.2045

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,912,648 bytes

Training Techniques

Quantization

fp16

bits: 16

scope: embeddings

Architecture

tied embeddings

Uses tied input/output embeddings with FP16 export for the embedding path.

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":20000}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"backend_steps":5}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

doc-isolated eval

parameters: null

Regularization

gradient clipping

parameters: {"norm":1}

Other

other

Uses a longer training context and doc-isolated scoring to reduce cross-document context bleed.

parameters: {"train_batch_tokens":524288,"eval_batch_seqs":32}