PR #659
closedRecord: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM
by deanbrrView on GitHub
val_bpb
1.0920
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB
Training Techniques
Architecture
MLP3x
Three-layer MLP block
parameters: null
LeakyReLU
LeakyReLU(0.5) squared in the MLP
parameters: {"slope":0.5,"power":2}
BigramHash
BigramHash component used in the architecture
parameters: {"size":1536}
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16,"total_dimensions":64}
VE128
VE128 applied to layers 9-10
parameters: {"layers":[9,10]}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":128}
online n-gram cache eval
parameters: {"ngram_max_n":5,"confidence_threshold":0.5,"min_count":3,"ngram_lambda":0.15}
Test-Time Training
TTT disabled
parameters: null
Novel Contributions
- Online 5-gram evaluation cache accumulated from already-scored tokens during sliding-window validation
- Confidence-gated log-sum-exp mixing with a safety gate that can never worsen a prediction
- Strictly backward-looking CPU-only n-gram lookup strategy with zero GPU cost
- Eval-time improvement only, with no training changes to the base model
- Stride-based evaluation configuration tuned to fit within the time budget