PR #962
openRecord: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)
by AnirudhRahulView on GitHub
val_bpb
0.0214
Architecture
Transformer
Optimizer
—
Artifact Size
15,849,498 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: model weights
Architecture
BigramHash
Packed order-2..9 n-gram cache / experts used with a learned gate for evaluation-time scoring.
parameters: {"orders":"2..9","buckets":32768}
weight tying
Not explicitly stated as tied embeddings; no evidence of weight tying in the submission.
parameters: null
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"epochs":0,"freeze_blocks":2,"learning_rate":0.0001}
Sequence Length
sequence_length
train_length: 131072
eval_length: null
Regularization
weight decay
parameters: {"weight_decay":0.01}
Other
other
Learned gate over neural and n-gram experts with context-only expert availability masking.
parameters: null
other
Online logit calibration during evaluation.
parameters: null
Novel Contributions
- Packed order-2..9 training n-gram artifact persisted inside the submission artifact
- Learned gate over neural and n-gram experts with context-only expert availability
- Removal of the logistic context mixer from the final eval path
- Removal of the long phrase cache from the final eval path
- Single-pass causal evaluation with cache updates only after scoring each chunk
- GPTQ calibration using cached training batches within the training budget
- Low eval-time memory regime with a fixed 2 MiB n-gram cache