PR #1683

open

10-min record: 13L int4 MLP + qTTT + QAT Precompile + ANS Hybrid (val…

by yunoshevView on GitHub

val_bpb

1.1280

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.35 MB

Training Techniques

Architecture

GQA

13-layer transformer with grouped query attention, 8 heads and kv=4.

parameters: {"layers":13,"d_model":512,"heads":8,"kv":4}

Partial RoPE

Uses partial rotary position embeddings with 0.25 coverage.

parameters: {"ratio":0.25}

weight tying

Tied embeddings are used.

parameters: null

Gated Attention

Attention uses a gating mechanism.

parameters: null

Value Residual

Includes a value residual pathway.

parameters: null

BigramHash

Adds a bigram hash table embedding head.

parameters: {"dimensions":2048}

VE128

Uses a value embedding head.

parameters: {"dimensions":96}

Quantization

QAT

bits: 4

scope: MLP

QAT

bits: 5

scope: attention

late QAT

bits: 4

scope: model

GPTQ

bits: null

scope: per-layer

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":200}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"adamw_for_scalars":true}

Evaluation

sliding window eval

parameters: {"stride":256}

Test-Time Training

qTTT

parameters: {"learning_rate":0.002,"epochs":3,"target":"qo_bank"}

Compression

ANS + brotli

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 32000

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

13-layer int4 MLP transformer with extra depth to offset aggressive quantization
qTTT test-time training on the Q-projection bank
QAT precompile warmup to avoid torch.compile recompilation stalls when late QAT activates
Hybrid ANS/Brotli artifact compression choosing the smaller encoding per tensor
Adaptive Hessian-weighted GPTQ auto-clip over multiple sigma candidates
Training-only document-boundary attention with eval-time varlen disabled for fused TTT sliding evaluation
Fused TTT plus sliding-window evaluation within the 600-second budget