PR #2143

open

Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)

by upascalView on GitHub

val_bpb

1.0713

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.87 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"rope_fraction":"16/64"}

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

depth recurrence

Layers 3-5 are looped twice starting at fraction 0.35.

parameters: {"layers":[3,4,5],"loops":2,"start_fraction":0.35}

parallel residuals

Layers 7-11 use simple parallel attention+MLP residual summation.

parameters: {"layers":[7,8,9,10,11]}

SmearGate

BOS-masked token mixing gate with a fixed window.

parameters: {"gate_window":12}

SparseAttnGate

Per-head zero-init sigmoid gate on attention output.

parameters: {"params_per_layer":96}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"matrix_lr":0.026,"warmdown_frac":0.85,"min_lr":0.1,"ema_decay":0.9965}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

logit softcap

parameters: {"value":30}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Quantization

GPTQ

bits: null

scope: model weights

mixed int5/int6/int7

bits: null

scope: q/proj/mlp_proj, kv/mlp_fc, tok_emb

Hadamard rotation

bits: null

scope: quantization preprocessing

LQER

bits: 4

scope: attn_proj and mlp_proj

Test-Time Training

LoRA TTT

parameters: {"rank":80,"alpha":144,"phases":3,"prefix_docs":2500,"learning_rate":0.0001}

Other

other

CaseOps tokenizer transform applied to SentencePiece tokenization.

parameters: {"tokenizer_vocab":12288}

other

CUDA graphs and fused softcapped cross-entropy Triton kernel used for training efficiency.

parameters: null

Novel Contributions

Lossless CaseOps tokenizer transform on top of SentencePiece
SparseAttnGate attention gating
Phased TTT with LoRA adaptation
Fixes for cu_seqlens plumbing in TTT global SGD
Fixes for parallel-lane mismatch in forward_ttt
Mixed-bit GPTQ with Hadamard rotation and LQER