PR #921

open

Record: Order-13 Full-Rescore N-gram + 11L Int6 GPTQ — val_bpb 0.0939 (3-seed mean)

by TimPietruskyView on GitHub
val_bpb
0.0939
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.8MB

Training Techniques

Architecture
Gated Attention
Attention mechanism modified with gating.
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
Value Residual
Adds value residual connections and value embeddings in later layers.
parameters: {"value_embedding_layers":[8,9,10]}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Uses rotary position embeddings on only part of the head dimension.
parameters: {"dimensions":64}
BigramHash
Hash-based bigram embedding with tied embeddings.
parameters: {"vocab":1024,"dim":256}
weight tying
Input and output embeddings are tied.
parameters: null
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.05}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: null
Compression
lzma
level: 8
Regularization
magnitude pruning
parameters: {"prune_rate":0.05}
logit softcap
parameters: {"value":20}
Other
other
Two-pass order-13 backward-looking n-gram evaluation cache with entropy-adaptive mixing and full-rescore pass.
parameters: {"order":13,"passes":2,"entropy_center":3,"entropy_scale":2}

Novel Contributions

  • Two-pass order-13 backward-looking n-gram eval cache
  • Full-rescore pass using the complete cache without additional forward passes
  • Entropy-adaptive mixing between model probabilities and n-gram cache
  • Int6 GPTQ with descending actorder and dead-column handling
  • Pure NumPy vectorized cache implementation with XOR-of-products hashing and np.bincount updates
  • Artifact compression with lzma to fit int6 model within the submission limit