PR #802

open

10L + Multi-Order N-gram Backoff (0.9123 BPB)

by BortlesboatView on GitHub
val_bpb
0.9123
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.63 MB

Training Techniques

Architecture
BigramHash
Hashed n-gram cache / bigram hash feature used in the model.
parameters: {"buckets":4096,"dim":128}
SmearGate
Gating mechanism included in the architecture.
parameters: null
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"fraction":"16/64"}
XSA
XSA used in the last 4 layers.
parameters: {"layers":4}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU^2
Uses LeakyReLU with slope 0.5 squared.
parameters: {"slope":0.5}
Regularization
LN Scale
parameters: null
Quantization
mixed int5/int6
bits: 5
scope: MLP and attention
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
multi-order n-gram backoff
parameters: {"orders":[2,3,4,5,6,7],"highest_matching_order_wins":true,"score_first":true,"min_count":2}
entropy-adaptive alpha
parameters: {"formula":"alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}
sliding window eval
parameters: {"stride":64,"batch_seqs":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["lm_head","Q","V"]}
Initialization
orthogonal init
Orthogonal initialization used for the model.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Score-first neural cache / hashed n-gram cache updated only after scoring each segment.
parameters: {"cache_orders":[2,3,4,5,6,7]}

Novel Contributions

  • Multi-order n-gram backoff evaluation with highest-matching-order selection
  • Entropy-adaptive interpolation coefficient for cache/backoff scoring
  • Score-first cache update policy to avoid leakage
  • Hashed n-gram cache across orders 2 through 7
  • Mixed int5/int6 quantization with zstd roundtrip
  • Neural cache evaluation using cosine similarity over cached hidden states
  • Per-document LoRA test-time training on lm_head, Q, and V projections