PR #174

open

Add ContextFuse-2048-BigramSmear submission

by Julz19View on GitHub

val_bpb

1.1537

Architecture

Transformer

Optimizer

Muon

Artifact Size

15331125 bytes

Training Techniques

Architecture

BigramHash

Adds bigram token-pair features on the input path.

parameters: {"bigram_vocab_size":4096,"bigram_dim":128}

SmearGate

Blends each token representation with the previous token to smooth inputs.

parameters: null

Quantization

mixed int6

bits: 6

scope: large MLP and attention matrices

Weight Averaging

SWA

parameters: {"start_frac":0.5,"every":200}

Optimizer

Muon

weight_decay: 0.02

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"adam_weight_decay":0.01,"muon_weight_decay":0.02}

Initialization

OrthoInit

Orthogonal initialization used for the model.

Other

other

Fixed the sliding-window evaluator to avoid rescoring overlapping tail tokens in truncated windows.

parameters: null

Novel Contributions

Adds BigramHash token-pair features to the input path
Introduces SmearGate input smoothing
Uses mixed int6 export for large attention and MLP matrices
Applies SWA over the late low-learning-rate phase
Uses Muon with tuned momentum and weight decay
Fixes the sliding-window evaluation bug that previously double-counted tail tokens
Updates the canonical metric using an exact reevaluation of the saved seed=1337 checkpoint