val_bpb
1.1779
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,929,105 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings and preserves them in fp16 for better post-quantization fidelity.
parameters: {"tie_embeddings":1}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":256}
Quantization
int8
bits: 8
scope: model weights with fp16 tied embeddings passthrough
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_backend_steps":5,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"max_wallclock_seconds":599}
Other
other
Byte-safe export revision by disabling fp16 passthrough for late-K layers while keeping fp16 embedding passthrough.
parameters: {"fp16_embed_passthrough":1,"fp16_late_k_layers":0}
Novel Contributions
- Long-context training at 2048 tokens instead of the 1024-token baseline
- Sliding-window final evaluation with stride 64 to improve context coverage during scoring
- FP16 tied-embedding export to preserve the highest-value tensor under quantization
- Byte-safe architecture adjustment using MLP hidden size 992 to offset fp16 embedding cost
- Muon-smoothed optimization with lower learning rates and warmdown tuned for the 2048-context regime
- Standalone record-folder submission artifact with Modal orchestration removed