PR #1935

open

Record candidate: PR #1855 + TTT_LORA_RANK=56 — val_bpb 1.05997 (s42)

by vimetoView on GitHub

val_bpb

1.0600

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,953,743 B

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":56}

Architecture

SmearGate

BOS-fixed SmearGate used in the inherited #1855 stack.

parameters: null

SparseAttnGate

Sparse attention gating used in the inherited #1855 stack.

parameters: null

CaseOps tokenizer

Tokenizer reserves operator tokens for case transformations to make vocab case-insensitive and free capacity for content tokens.

parameters: null

Quantization

GPTQ

bits: 6

scope: body weights

mixed int6/int8

bits: null

scope: embeddings/body variants explored

INT5/INT4 embed quant

bits: 5

scope: embeddings

Weight Averaging

EMA

parameters: {"post_ema_mask_reapply":true}

Compression

lrzip

level: null

Regularization

LQER

parameters: {"top_k":4,"rank":6}

sparse embeddings

parameters: {"sparsity":0.9}

magnitude pruning

parameters: {"top_k_mask":true}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Other

other

Per-row top-K sparse FP16 embedding training with post-EMA mask reapplication to preserve sparsity at serialization time.

parameters: {"embed_lr":0.3}

Novel Contributions

Lowered TTT LoRA rank from 80 to 56 and improved val_bpb to 1.05997.
Raised QK_GAIN_INIT from 5.0 to 6.0 as a local optimum in this lineage.
Identified an inverted-U rank ablation with rank 56 as the best setting.
Validated the submission on both MI250X and H100 with matching results.
Used per-group lrzip artifact packing to fit under the 16 MB cap.