PR #143

open

Add ContextFuse-2048 submission

by Julz19View on GitHub

val_bpb

1.1779

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,929,105 bytes

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings and preserves them in fp16 for better post-quantization fidelity.

parameters: {"tie_embeddings":1}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":256}

Quantization

int8

bits: 8

scope: model weights with fp16 tied embeddings passthrough

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_backend_steps":5,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"max_wallclock_seconds":599}

Other

other

Byte-safe export revision by disabling fp16 passthrough for late-K layers while keeping fp16 embedding passthrough.

parameters: {"fp16_embed_passthrough":1,"fp16_late_k_layers":0}

Novel Contributions

Long-context training at 2048 tokens instead of the 1024-token baseline
Sliding-window final evaluation with stride 64 to improve context coverage during scoring
FP16 tied-embedding export to preserve the highest-value tensor under quantization
Byte-safe architecture adjustment using MLP hidden size 992 to offset fp16 embedding cost
Muon-smoothed optimization with lower learning rates and warmdown tuned for the 2048-context regime
Standalone record-folder submission artifact with Modal orchestration removed