PR #893
openRecord: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed)
by aryanbhosaleView on GitHub
val_bpb
0.1310
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.85 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"passes":2,"cache_orders":"2-12","cold_cache_chunks":50}
Architecture
Parallel Muon
Parallel Muon optimizer with parameter banking and batched Newton-Schulz.
parameters: {"layers":11,"dimensions":512}
BigramHash
Bigram hash feature module.
parameters: {"size":1024}
Gated Attention
Attention mechanism with gating.
parameters: null
Value Residual
Residual value pathway in the model.
parameters: null
XSA
XSA4 attention/sequence module.
parameters: {"variant":"XSA4"}
SmearGate
SmearGate component used in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections.
parameters: null
Partial RoPE
Partial rotary positional embeddings.
parameters: {"16/64":true}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"mlp_multiplier":"3x","power":2,"slope":0.5}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: null
eval_length: 65536
Novel Contributions
- Two-pass evaluation with order-12 N-gram backoff rescoring
- Entropy-adaptive alpha blending for N-gram/model interpolation
- Backward-looking N-gram cache updated only after scoring
- Parallel Muon optimization with parameter banking
- Large hash-based N-gram cache over validation tokens