PR #778
openRecord: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)
by raahilshahView on GitHub
val_bpb
0.9605
Architecture
11L Transformer
Optimizer
Parallel Muon
Artifact Size
15.92 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
XSA applied across all layers as part of the custom architecture
parameters: {"layers":11}
SmearGate
Custom gating mechanism in the architecture
parameters: null
BigramHash
Hashed bigram feature component used in the model
parameters: {"buckets":2048}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions
parameters: {"train_or_eval":null,"dimensions":"16/64"}
MLP3x
Three-layer MLP block
parameters: {"multiplier":3}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"backward_looking_cache":true,"ngram_orders":"2-7"}
Regularization
LN Scale
parameters: null
Other
other
Multi-order backward-looking n-gram backoff cache with fixed or entropy-adaptive interpolation between model and n-gram probabilities
parameters: {"orders":"2-7","min_count":2,"buckets_per_order":4194304}
Novel Contributions
- Full Hessian GPTQ int6 quantization within the training budget
- Multi-order n-gram backoff cache using orders 2-7
- Fixed-alpha interpolation between neural model and n-gram probabilities
- Entropy-adaptive alpha based only on model output entropy
- Backward-looking cache updates after scoring each window
- Record-setting 3-seed validation performance