PR #802

open

10L + Multi-Order N-gram Backoff (0.9123 BPB)

by BortlesboatView on GitHub

val_bpb

0.9123

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.63 MB

Training Techniques

Architecture

BigramHash

Hashed n-gram cache / bigram hash feature used in the model.

parameters: {"buckets":4096,"dim":128}

SmearGate

Gating mechanism included in the architecture.

parameters: null

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"fraction":"16/64"}

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU^2

Uses LeakyReLU with slope 0.5 squared.

parameters: {"slope":0.5}

Regularization

LN Scale

parameters: null

Quantization

mixed int5/int6

bits: 5

scope: MLP and attention

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

multi-order n-gram backoff

parameters: {"orders":[2,3,4,5,6,7],"highest_matching_order_wins":true,"score_first":true,"min_count":2}

entropy-adaptive alpha

parameters: {"formula":"alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

sliding window eval

parameters: {"stride":64,"batch_seqs":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["lm_head","Q","V"]}

Initialization

orthogonal init

Orthogonal initialization used for the model.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Score-first neural cache / hashed n-gram cache updated only after scoring each segment.

parameters: {"cache_orders":[2,3,4,5,6,7]}

Novel Contributions

Multi-order n-gram backoff evaluation with highest-matching-order selection
Entropy-adaptive interpolation coefficient for cache/backoff scoring
Score-first cache update policy to avoid leakage
Hashed n-gram cache across orders 2 through 7
Mixed int5/int6 quantization with zstd roundtrip
Neural cache evaluation using cosine similarity over cached hidden states
Per-document LoRA test-time training on lm_head, Q, and V projections