PR #778

open

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)

by raahilshahView on GitHub

val_bpb

0.9605

Architecture

11L Transformer

Optimizer

Parallel Muon

Artifact Size

15.92 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

XSA applied across all layers as part of the custom architecture

parameters: {"layers":11}

SmearGate

Custom gating mechanism in the architecture

parameters: null

BigramHash

Hashed bigram feature component used in the model

parameters: {"buckets":2048}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions

parameters: {"train_or_eval":null,"dimensions":"16/64"}

MLP3x

Three-layer MLP block

parameters: {"multiplier":3}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"backward_looking_cache":true,"ngram_orders":"2-7"}

Regularization

LN Scale

parameters: null

Other

other

Multi-order backward-looking n-gram backoff cache with fixed or entropy-adaptive interpolation between model and n-gram probabilities

parameters: {"orders":"2-7","min_count":2,"buckets_per_order":4194304}

Novel Contributions

Full Hessian GPTQ int6 quantization within the training budget
Multi-order n-gram backoff cache using orders 2-7
Fixed-alpha interpolation between neural model and n-gram probabilities
Entropy-adaptive alpha based only on model output entropy
Backward-looking cache updates after scoring each window
Record-setting 3-seed validation performance