PR #2026

open

[DRAFT] Record: Variable-Rank LQER + Muon-TTT — val_bpb TBD

by RahimMiraniView on GitHub
val_bpb
1.0611
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
XSA
Applied XSA to all 11 layers.
parameters: {"layers":11}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: null
U-Net skip connections
Encoder-decoder skip connections with skip gates.
parameters: null
depth recurrence
Loops layers 3-5 three times once the fraction threshold is reached.
parameters: {"layers":[3,5],"repeats":3,"threshold_frac":0.35}
RoPE
Partial RoPE with YaRN scaling.
parameters: {"dimensions":16,"total_dimensions":64}
SmearGate
Position-mixing gate with BOS leak fix using a not-BOS mask.
parameters: null
Gated Attention
Sparse attention gate on head outputs.
parameters: {"gate_window":12}
weight tying
Tied embeddings are used.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"steps":5,"backend":"Polar-Express Newton-Schulz"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrix weights
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 8
scope: attention gate
int4
bits: 4
scope: LQER correction
Compression
custom
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"phases":3,"prefix_docs":2500}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"min_lr":0.1}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Variable-rank Hessian-allocated LQER while preserving total rank budget
  • Muon-style Newton-Schulz update direction for TTT behind an environment flag
  • Longer phased TTT prefix as a runtime-only knob
  • Byte-accounting sanity assertion before claiming results
  • BOS leak fix for SmearGate in packed validation streams
  • Per-group compression pipeline using lrzip/ZPAQ on hot tensor groups with similarity-sorted rows