PR #962

open

Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)

by AnirudhRahulView on GitHub

val_bpb

0.0214

Architecture

Transformer

Optimizer

—

Artifact Size

15,849,498 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

Architecture

BigramHash

Packed order-2..9 n-gram cache / experts used with a learned gate for evaluation-time scoring.

parameters: {"orders":"2..9","buckets":32768}

weight tying

Not explicitly stated as tied embeddings; no evidence of weight tying in the submission.

parameters: null

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

TTT

parameters: {"epochs":0,"freeze_blocks":2,"learning_rate":0.0001}

Sequence Length

sequence_length

train_length: 131072

eval_length: null

Regularization

weight decay

parameters: {"weight_decay":0.01}

Other

other

Learned gate over neural and n-gram experts with context-only expert availability masking.

parameters: null

other

Online logit calibration during evaluation.

parameters: null

Packed order-2..9 training n-gram artifact persisted inside the submission artifact
Learned gate over neural and n-gram experts with context-only expert availability
Removal of the logistic context mixer from the final eval path
Removal of the long phrase cache from the final eval path
Single-pass causal evaluation with cache updates only after scoring each chunk
GPTQ calibration using cached training batches within the training budget
Low eval-time memory regime with a fixed 2 MiB n-gram cache