PR #871
openNon-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)
by greqoneView on GitHub
val_bpb
0.8004
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.18 MB
Training Techniques
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Hashed bigram embedding component.
parameters: {"buckets":4096,"dim":128}
SmearGate
SmearGate gating mechanism.
parameters: null
Value Residual
Value residual pathway in the attention stack.
parameters: null
Gated Attention
Attention mechanism with gating.
parameters: null
XSA
XSA used in the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"train":16,"total":64}
LN Scale
LayerNorm scale modification.
parameters: null
U-Net skip connections
U-Net style skip connections in the model.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
LeakyReLU
MLP uses LeakyReLU activation with squared variant.
parameters: {"multiplier":"3x","squared":true,"slope":0.5}
Regularization
logit softcap
parameters: {"value":30}
magnitude pruning
parameters: {"sparsity":"3%"}
Quantization
mixed int5/int6
bits: null
scope: MLP/attn
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"lr":0.03}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: null
eval_length: null
Evaluation
score-first n-gram backoff
parameters: {"orders":"2-7","entropy_adaptive_alpha":true,"min_count":2,"hash_buckets":4000000}
Novel Contributions
- Multi-order backward-looking n-gram backoff evaluation cache
- Entropy-adaptive alpha for mixing model and n-gram scores
- Score-first legal evaluation that updates cache only after scoring each token
- Highest-matching-order backoff from 7-gram to bigram
- Proxy-validated 1xH100 run showing 0.8004 val_bpb