PR #65
RECORDclosedRecord: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556
by aquariouseworkmanView on GitHub
val_bpb
1.1556
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.1MB
Training Techniques
Quantization
mixed int6/int8 STE QAT
bits: 6
scope: all 2D block weights int6; token embeddings int8/fp16 passthrough
Architecture
SmearGate
Learned per-dimension gate blends current token embedding with previous token embedding before transformer layers.
parameters: {"dim":512}
BigramHash
Hash-based bigram embedding over consecutive token pairs to inject token-pair context.
parameters: {"buckets":4096,"dim":128}
MLP3x
Expanded MLP hidden size to 3x model dimension for greater capacity.
parameters: {"multiplier":3,"hidden_dim":1536}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Encoder-decoder style skip connections between corresponding transformer layers.
parameters: {"layers":9}
Optimizer
Muon
weight_decay: 0.01
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}
Initialization
OrthoInit
Orthogonal initialization for non-zero-init linear weights.
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
linear warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Compression
zstd
level: 22
Novel Contributions
- SmearGate embedding that blends current and previous token embeddings
- Bigram hash embedding for direct token-pair features
- Orthogonal weight initialization combined with Muon optimization
- Mixed int6/int8 quantization-aware training with STE
- Wider 3x MLP expansion enabled by quantization savings
- U-Net style skip connections in a transformer
- Sliding window evaluation with stride 64