PR #1935

open

Record candidate: PR #1855 + TTT_LORA_RANK=56 — val_bpb 1.05997 (s42)

val_bpb
1.0600
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,953,743 B

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"rank":56}
Architecture
SmearGate
BOS-fixed SmearGate used in the inherited #1855 stack.
parameters: null
SparseAttnGate
Sparse attention gating used in the inherited #1855 stack.
parameters: null
CaseOps tokenizer
Tokenizer reserves operator tokens for case transformations to make vocab case-insensitive and free capacity for content tokens.
parameters: null
Quantization
GPTQ
bits: 6
scope: body weights
mixed int6/int8
bits: null
scope: embeddings/body variants explored
INT5/INT4 embed quant
bits: 5
scope: embeddings
Weight Averaging
EMA
parameters: {"post_ema_mask_reapply":true}
Compression
lrzip
level: null
Regularization
LQER
parameters: {"top_k":4,"rank":6}
sparse embeddings
parameters: {"sparsity":0.9}
magnitude pruning
parameters: {"top_k_mask":true}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Other
other
Per-row top-K sparse FP16 embedding training with post-EMA mask reapplication to preserve sparsity at serialization time.
parameters: {"embed_lr":0.3}

Novel Contributions

  • Lowered TTT LoRA rank from 80 to 56 and improved val_bpb to 1.05997.
  • Raised QK_GAIN_INIT from 5.0 to 6.0 as a local optimum in this lineage.
  • Identified an inverted-U rank ablation with rank 56 as the best setting.
  • Validated the submission on both MI250X and H100 with matching results.
  • Used per-group lrzip artifact packing to fit under the 16 MB cap.