PR #700
openRecord Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64
by RoyiRaView on GitHub
val_bpb
1.0541
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.7 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Compression
zstd
level: 22
Architecture
XSA
Applied XSA across all 11 layers with window size 8.
parameters: {"layers":11,"ws":8}
BigramHash
Added BigramHash feature with dimension 128.
parameters: {"size":6144,"dim":128}
MLP3x
Used a widened MLP with LeakyReLU activations.
parameters: {"multiplier":3.5}
KV head count
Used 8 attention heads and 8 KV heads in an 11-layer, 512d model.
parameters: {"layers":11,"hidden_size":512,"heads":8,"kv_heads":8}
Weight Averaging
Polyak averaging
parameters: {"decay":0.998}
Evaluation
stride-based sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.0001,"freeze_blocks":2,"chunk_tokens":131072}
LR Schedule
warmdown
parameters: {"targets_seconds":582}
Regularization
magnitude pruning
parameters: {"sparsity":0.03}
Other
other
CROWN-Q training-time quantization-aware penalty during warmdown to reduce quantization sensitivity.
parameters: {"lambda":0.01}
other
5-expert Hedge mixer combining neural, unigram, bigram, trigram, and entropy experts.
parameters: {"experts":5}
Novel Contributions
- Added a CROWN-Q quantization-aware training penalty during warmdown to improve quantization robustness.
- Increased evaluation stride from 32 to 64 to halve eval cost while preserving BPB quality.
- Used the saved evaluation time to increase test-time training from 3 to 4 epochs per chunk.
- Combined a 5-expert Hedge mixer with GPTQ int5 compression and CROWN-Q for a new record score.