PR #887
closedRecord: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)
by anthony-maioView on GitHub
val_bpb
0.9642
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.95 MB
Training Techniques
Architecture
LeakyReLU
LeakyReLU squared MLP activation used in the feedforward network.
parameters: {"power":2,"slope":0.5}
VRL
Value Residual Learning applied in the model.
parameters: null
VE128
Value embedding dimension set to 128.
parameters: {"dimensions":128}
BigramHash
Bigram hash feature with 2048 buckets.
parameters: {"buckets":2048}
XSA
XSA attention variant used in the architecture.
parameters: null
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"train_length":16,"eval_length":64}
SmearGate
SmearGate component included in the model.
parameters: null
U-Net skip connections
U-Net style skip connections in the network.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Regularization
LN scale
parameters: null
weight decay
parameters: {"value":0.04}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"ngram_backoff":true,"orders":"2-7"}
Novel Contributions
- Multi-order causal n-gram backoff cache built from already-scored tokens
- Entropy-adaptive mixing between neural predictions and n-gram predictions
- Highest-order-wins backoff across 2-7 gram contexts with min_count gating
- Score-first evaluation compliance with post-token table updates only
- Combination of VRL, LeakyReLU², and compressed GPTQ-lite int6 model artifact