PR #443
closedBigram-Aware Context Modeling with Mixed-Precision Quantization (val_bpb: 1.1431)
by CREVIOSView on GitHub
val_bpb
1.1431
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB
Training Techniques
Architecture
BigramHash
Learned hashed embedding for consecutive token pairs to inject explicit bigram context.
parameters: {"bucket_count":10240,"dimension":128}
SmearGate
Per-dimension sigmoid gate blending current token embeddings with previous token embeddings.
parameters: null
MLP3x
Uses 3x MLP expansion to increase capacity within the artifact budget.
parameters: {"multiplier":3}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Encoder-decoder style skip connections between matching depths.
parameters: {"encoder_layers":5,"decoder_layers":5}
residual mixing
Learned mixing between running hidden state and original post-embedding representation.
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6, embeddings FP16, control tensors FP32
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":24,"start_fraction":0.4,"every_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Initialization
Orthogonal init
Orthogonal initialization with gain 1.0 and muP output scaling.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
linear warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"prune_frac":0.03}
Novel Contributions
- BigramHash embedding to inject explicit token-pair context
- SmearGate for learned per-dimension blending of adjacent token embeddings
- Mixed-precision quantization with int5 for MLP weights and int6 for attention weights
- Using int5 savings to fund an additional transformer layer under the 16MB cap
- U-Net style skip connections and residual mixing in a transformer
- SWA over the last portion of training to improve quantization robustness and compression
- Sliding-window evaluation with stride 64 to score tokens with much longer context