PR #447

open

Bigram-Aware Context Modeling with Mixed-Precision Quantization (val_bpb: 1.1431)

by CREVIOSView on GitHub

val_bpb

1.1431

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.97 MB

Training Techniques

Architecture

BigramHash

Learned hashed embedding for consecutive token pairs to inject explicit bigram context.

parameters: {"buckets":10240,"dimension":128}

SmearGate

Per-dimension sigmoid gate blending current token embeddings with previous token embeddings.

parameters: null

MLP3x

Uses 3x MLP expansion to increase capacity within the artifact budget.

parameters: {"multiplier":3,"hidden_dim":1536}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

depth

10-layer transformer with encoder-decoder style skip connections.

parameters: {"layers":10}

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6, embeddings fp16, some control tensors fp32

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoints":24,"start_fraction":0.4,"every_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Initialization

Orthogonal init

Gain 1.0 with muP output scaling.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

linear warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"fraction":0.03}

Novel Contributions

BigramHash embedding to inject explicit token-pair context
SmearGate for learned blending of adjacent token embeddings
Mixed-precision quantization with int5 for MLP weights and int6 for attention weights
Using 3x MLP expansion and an extra transformer layer funded by quantization savings
SWA over the final training phase to improve quantization robustness and compression
Sliding-window evaluation with stride 64 to score tokens with much longer effective context