PR #1269

open

[Submission] Jtss-ux - 1.1301 BPB (10min_16mb)

by Jtss-uxView on GitHub
val_bpb
1.1194
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.95 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP instead of ReLU^2 to preserve negative gradient flow while keeping non-negative outputs.
parameters: {"negative_slope":0.5,"power":2}
BigramHash
Uses a bigram hash embedding component.
parameters: {"size":1536}
XSA
Applies XSA in the last 4 layers.
parameters: {"last_n_layers":4}
Partial RoPE
Uses RoPE on a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Adds VE128 to selected layers.
parameters: {"layers":[9,10],"dimension":128}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":32768,"learning_rate":0.002,"epochs":3,"momentum":0.9,"freeze_blocks":0,"grad_clip":1}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: null
eval_length: 32768

Novel Contributions

  • LeakyReLU(0.5)^2 activation replacing ReLU^2
  • Legal score-first test-time training with inference_mode scoring before adaptation
  • Parallel Muon optimizer with parameter banking
  • EMA plus Tight SWA weight averaging
  • GPTQ-lite int6 quantization with lzma compression