PR #65

RECORDclosed

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556

by aquariouseworkmanView on GitHub
val_bpb
1.1556
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.1MB

Training Techniques

Quantization
mixed int6/int8 STE QAT
bits: 6
scope: all 2D block weights int6; token embeddings int8/fp16 passthrough
Architecture
SmearGate
Learned per-dimension gate blends current token embedding with previous token embedding before transformer layers.
parameters: {"dim":512}
BigramHash
Hash-based bigram embedding over consecutive token pairs to inject token-pair context.
parameters: {"buckets":4096,"dim":128}
MLP3x
Expanded MLP hidden size to 3x model dimension for greater capacity.
parameters: {"multiplier":3,"hidden_dim":1536}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Encoder-decoder style skip connections between corresponding transformer layers.
parameters: {"layers":9}
Optimizer
Muon
weight_decay: 0.01
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}
Initialization
OrthoInit
Orthogonal initialization for non-zero-init linear weights.
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
linear warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Compression
zstd
level: 22

Novel Contributions

  • SmearGate embedding that blends current and previous token embeddings
  • Bigram hash embedding for direct token-pair features
  • Orthogonal weight initialization combined with Muon optimization
  • Mixed int6/int8 quantization-aware training with STE
  • Wider 3x MLP expansion enabled by quantization savings
  • U-Net style skip connections in a transformer
  • Sliding window evaluation with stride 64