PR #596

closed

Record: DeepQuant V10b — 11L INT6 + 8ep LoRA TTT (val_bpb=0.6430)

by AriaAnimaView on GitHub

val_bpb

0.6430

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.73 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

BigramHash

Adds hashed bigram context via BigramHash(2048) and SmearGate.

parameters: {"size":2048}

SmearGate

Parameter-efficient gating mechanism used with bigram context.

parameters: null

MLP3x

Uses a 3x MLP expansion in the transformer blocks.

parameters: {"expansion":3}

KV head count

Uses 4 KV heads with 8 attention heads (GQA).

parameters: {"attention_heads":8,"kv_heads":4}

depth recurrence

Uses U-Net skip connections between encoder/decoder layer pairs and depth-scaled residuals.

parameters: {"layers":11}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_whitening":true,"adamw_for_scalars_embeddings":true}

Weight Averaging

EMA

parameters: {"decay":0.999,"every_steps":10}

SWA

parameters: {"checkpoints":12,"phase":"final warmdown"}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank_qv":8,"rank_lm_head":16,"epochs":8,"learning_rate":0.01,"chunk_size":256,"batch_size":64,"min_doc_length":512,"max_doc_length":50000,"temperature":0.98,"bias_tuning":true,"score_every_epoch":true,"wall_clock_limit_s":570}

LR Schedule

warmdown

parameters: {"wallclock_based":true}

cosine decay

parameters: {"min_lr_fraction":0.1,"within_ttt":true}

Regularization

weight decay

parameters: null

pruning

parameters: {"magnitude_pruning_percent":4}

Other

other

Zigzag GPU load balancing across 8 GPUs to reduce synchronization bottlenecks.

parameters: {"gpus":8}

other

Outlier document filtering: documents over 50,000 tokens are scored with the base model without TTT.

parameters: {"max_doc_length":50000}

Novel Contributions

8-epoch per-document LoRA test-time training
Score-every-epoch backward-looking TTT compliance
Cosine learning-rate decay within TTT
LM-head LoRA rank-16 adaptation
Per-block bias tuning during TTT
Post-TTT temperature rescaling
Zigzag GPU load balancing
Outlier document filtering for very long documents
Wall-clock-limited TTT with base-model fallback