PR #962

open

Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)

by AnirudhRahulView on GitHub
val_bpb
0.0214
Architecture
Transformer
Optimizer
Artifact Size
15,849,498 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
Architecture
BigramHash
Packed order-2..9 n-gram cache / experts used with a learned gate for evaluation-time scoring.
parameters: {"orders":"2..9","buckets":32768}
weight tying
Not explicitly stated as tied embeddings; no evidence of weight tying in the submission.
parameters: null
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"epochs":0,"freeze_blocks":2,"learning_rate":0.0001}
Sequence Length
sequence_length
train_length: 131072
eval_length: null
Regularization
weight decay
parameters: {"weight_decay":0.01}
Other
other
Learned gate over neural and n-gram experts with context-only expert availability masking.
parameters: null
other
Online logit calibration during evaluation.
parameters: null

Novel Contributions

  • Packed order-2..9 training n-gram artifact persisted inside the submission artifact
  • Learned gate over neural and n-gram experts with context-only expert availability
  • Removal of the logistic context mixer from the final eval path
  • Removal of the long phrase cache from the final eval path
  • Single-pass causal evaluation with cache updates only after scoring each chunk
  • GPTQ calibration using cached training batches within the training budget
  • Low eval-time memory regime with a fixed 2 MiB n-gram cache