PR #305
open12L Full-INT4 (MLP + Attn) + BigramHash(4096) — val_bpb 1.1672
by Naazimsnh02View on GitHub
val_bpb
1.1672
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.4 MB
Training Techniques
Quantization
int4
bits: 4
scope: MLP and attention weights
Architecture
BigramHash
Adds a BigramHash table for token interactions / auxiliary representation
parameters: {"vocab":4096,"dim":64}
weight tying
Tied embeddings
parameters: {"dim":512}
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
3x MLP expansion
parameters: {"hidden":1536}
RoPE
Rotary positional encoding
parameters: {"base":10000}
U-Net skip connections
Symmetric skip connections between encoder and decoder halves across layers
parameters: {"layers":12}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw_for":"scalars, embeddings"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
Orthogonal
Orthogonal initialization with muP-scaled output projections
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
10% magnitude pruning before quantisation to create zero runs that compress better
parameters: {"pruning_percentile":10}
Novel Contributions
- Group INT4 nibble-packing applied to both MLP and attention weights with gs=64 fp16 scales
- Freed quantization budget to enable 12 transformer layers instead of 10
- U-Net skip connections across the 12-layer model
- 10% magnitude pruning before quantisation to improve zstd compression
- BigramHash reduced to 4096 to fit within the 16 MB budget