PR #778

open

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)

by raahilshahView on GitHub
val_bpb
0.9605
Architecture
11L Transformer
Optimizer
Parallel Muon
Artifact Size
15.92 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
XSA applied across all layers as part of the custom architecture
parameters: {"layers":11}
SmearGate
Custom gating mechanism in the architecture
parameters: null
BigramHash
Hashed bigram feature component used in the model
parameters: {"buckets":2048}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions
parameters: {"train_or_eval":null,"dimensions":"16/64"}
MLP3x
Three-layer MLP block
parameters: {"multiplier":3}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"backward_looking_cache":true,"ngram_orders":"2-7"}
Regularization
LN Scale
parameters: null
Other
other
Multi-order backward-looking n-gram backoff cache with fixed or entropy-adaptive interpolation between model and n-gram probabilities
parameters: {"orders":"2-7","min_count":2,"buckets_per_order":4194304}

Novel Contributions

  • Full Hessian GPTQ int6 quantization within the training budget
  • Multi-order n-gram backoff cache using orders 2-7
  • Fixed-alpha interpolation between neural model and n-gram probabilities
  • Entropy-adaptive alpha based only on model output entropy
  • Backward-looking cache updates after scoring each window
  • Record-setting 3-seed validation performance