PR #596

closed

Record: DeepQuant V10b — 11L INT6 + 8ep LoRA TTT (val_bpb=0.6430)

by AriaAnimaView on GitHub
val_bpb
0.6430
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.73 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
BigramHash
Adds hashed bigram context via BigramHash(2048) and SmearGate.
parameters: {"size":2048}
SmearGate
Parameter-efficient gating mechanism used with bigram context.
parameters: null
MLP3x
Uses a 3x MLP expansion in the transformer blocks.
parameters: {"expansion":3}
KV head count
Uses 4 KV heads with 8 attention heads (GQA).
parameters: {"attention_heads":8,"kv_heads":4}
depth recurrence
Uses U-Net skip connections between encoder/decoder layer pairs and depth-scaled residuals.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_whitening":true,"adamw_for_scalars_embeddings":true}
Weight Averaging
EMA
parameters: {"decay":0.999,"every_steps":10}
SWA
parameters: {"checkpoints":12,"phase":"final warmdown"}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank_qv":8,"rank_lm_head":16,"epochs":8,"learning_rate":0.01,"chunk_size":256,"batch_size":64,"min_doc_length":512,"max_doc_length":50000,"temperature":0.98,"bias_tuning":true,"score_every_epoch":true,"wall_clock_limit_s":570}
LR Schedule
warmdown
parameters: {"wallclock_based":true}
cosine decay
parameters: {"min_lr_fraction":0.1,"within_ttt":true}
Regularization
weight decay
parameters: null
pruning
parameters: {"magnitude_pruning_percent":4}
Other
other
Zigzag GPU load balancing across 8 GPUs to reduce synchronization bottlenecks.
parameters: {"gpus":8}
other
Outlier document filtering: documents over 50,000 tokens are scored with the base model without TTT.
parameters: {"max_doc_length":50000}

Novel Contributions

  • 8-epoch per-document LoRA test-time training
  • Score-every-epoch backward-looking TTT compliance
  • Cosine learning-rate decay within TTT
  • LM-head LoRA rank-16 adaptation
  • Per-block bias tuning during TTT
  • Post-TTT temperature rescaling
  • Zigzag GPU load balancing
  • Outlier document filtering for very long documents
  • Wall-clock-limited TTT with base-model fallback