PR #447

open

Bigram-Aware Context Modeling with Mixed-Precision Quantization (val_bpb: 1.1431)

by CREVIOSView on GitHub
val_bpb
1.1431
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB

Training Techniques

Architecture
BigramHash
Learned hashed embedding for consecutive token pairs to inject explicit bigram context.
parameters: {"buckets":10240,"dimension":128}
SmearGate
Per-dimension sigmoid gate blending current token embeddings with previous token embeddings.
parameters: null
MLP3x
Uses 3x MLP expansion to increase capacity within the artifact budget.
parameters: {"multiplier":3,"hidden_dim":1536}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth
10-layer transformer with encoder-decoder style skip connections.
parameters: {"layers":10}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6, embeddings fp16, some control tensors fp32
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoints":24,"start_fraction":0.4,"every_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Initialization
Orthogonal init
Gain 1.0 with muP output scaling.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
linear warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"fraction":0.03}

Novel Contributions

  • BigramHash embedding to inject explicit token-pair context
  • SmearGate for learned blending of adjacent token embeddings
  • Mixed-precision quantization with int5 for MLP weights and int6 for attention weights
  • Using 3x MLP expansion and an extra transformer layer funded by quantization savings
  • SWA over the final training phase to improve quantization robustness and compression
  • Sliding-window evaluation with stride 64 to score tokens with much longer effective context