PR #2026

open

[DRAFT] Record: Variable-Rank LQER + Muon-TTT — val_bpb TBD

by RahimMiraniView on GitHub

val_bpb

1.0611

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

XSA

Applied XSA to all 11 layers.

parameters: {"layers":11}

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: null

U-Net skip connections

Encoder-decoder skip connections with skip gates.

parameters: null

depth recurrence

Loops layers 3-5 three times once the fraction threshold is reached.

parameters: {"layers":[3,5],"repeats":3,"threshold_frac":0.35}

RoPE

Partial RoPE with YaRN scaling.

parameters: {"dimensions":16,"total_dimensions":64}

SmearGate

Position-mixing gate with BOS leak fix using a not-BOS mask.

parameters: null

Gated Attention

Sparse attention gate on head outputs.

parameters: {"gate_window":12}

weight tying

Tied embeddings are used.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"steps":5,"backend":"Polar-Express Newton-Schulz"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrix weights

GPTQ

bits: 7

scope: embeddings

GPTQ

bits: 8

scope: attention gate

int4

bits: 4

scope: LQER correction

Compression

custom

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":80,"phases":3,"prefix_docs":2500}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Variable-rank Hessian-allocated LQER while preserving total rank budget
Muon-style Newton-Schulz update direction for TTT behind an environment flag
Longer phased TTT prefix as a runtime-only knob
Byte-accounting sanity assertion before claiming results
BOS leak fix for SmearGate in packed validation streams
Per-group compression pipeline using lrzip/ZPAQ on hot tensor groups with similarity-sorted rows