PR #443

closed

Bigram-Aware Context Modeling with Mixed-Precision Quantization (val_bpb: 1.1431)

by CREVIOSView on GitHub

val_bpb

1.1431

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.97 MB

Training Techniques

Architecture

BigramHash

Learned hashed embedding for consecutive token pairs to inject explicit bigram context.

parameters: {"bucket_count":10240,"dimension":128}

SmearGate

Per-dimension sigmoid gate blending current token embeddings with previous token embeddings.

parameters: null

MLP3x

Uses 3x MLP expansion to increase capacity within the artifact budget.

parameters: {"multiplier":3}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Encoder-decoder style skip connections between matching depths.

parameters: {"encoder_layers":5,"decoder_layers":5}

residual mixing

Learned mixing between running hidden state and original post-embedding representation.

parameters: null

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6, embeddings FP16, control tensors FP32

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":24,"start_fraction":0.4,"every_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Initialization

Orthogonal init

Orthogonal initialization with gain 1.0 and muP output scaling.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

linear warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"prune_frac":0.03}

Novel Contributions

BigramHash embedding to inject explicit token-pair context
SmearGate for learned per-dimension blending of adjacent token embeddings
Mixed-precision quantization with int5 for MLP weights and int6 for attention weights
Using int5 savings to fund an additional transformer layer under the 16MB cap
U-Net style skip connections and residual mixing in a transformer
SWA over the last portion of training to improve quantization robustness and compression
Sliding-window evaluation with stride 64 to score tokens with much longer context