PR #2143
openNon-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)
by upascalView on GitHub
val_bpb
1.0713
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.87 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"rope_fraction":"16/64"}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
depth recurrence
Layers 3-5 are looped twice starting at fraction 0.35.
parameters: {"layers":[3,4,5],"loops":2,"start_fraction":0.35}
parallel residuals
Layers 7-11 use simple parallel attention+MLP residual summation.
parameters: {"layers":[7,8,9,10,11]}
SmearGate
BOS-masked token mixing gate with a fixed window.
parameters: {"gate_window":12}
SparseAttnGate
Per-head zero-init sigmoid gate on attention output.
parameters: {"params_per_layer":96}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.026,"warmdown_frac":0.85,"min_lr":0.1,"ema_decay":0.9965}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Quantization
GPTQ
bits: null
scope: model weights
mixed int5/int6/int7
bits: null
scope: q/proj/mlp_proj, kv/mlp_fc, tok_emb
Hadamard rotation
bits: null
scope: quantization preprocessing
LQER
bits: 4
scope: attn_proj and mlp_proj
Test-Time Training
LoRA TTT
parameters: {"rank":80,"alpha":144,"phases":3,"prefix_docs":2500,"learning_rate":0.0001}
Other
other
CaseOps tokenizer transform applied to SentencePiece tokenization.
parameters: {"tokenizer_vocab":12288}
other
CUDA graphs and fused softcapped cross-entropy Triton kernel used for training efficiency.
parameters: null
Novel Contributions
- Lossless CaseOps tokenizer transform on top of SentencePiece
- SparseAttnGate attention gating
- Phased TTT with LoRA adaptation
- Fixes for cu_seqlens plumbing in TTT global SGD
- Fixes for parallel-lane mismatch in forward_ttt
- Mixed-bit GPTQ with Hadamard rotation and LQER