PR #1008

open

Add non-record unlimited-compute 11L LeakyTTT 16h local RTX 4060 Ti run

by monkeyKingProgrammerView on GitHub
val_bpb
1.1538
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,807,729 bytes

Training Techniques

Architecture
LeakyReLU
LeakyReLU^2 MLP activation
parameters: {"slope":0.5,"power":2}
XSA
XSA applied to the last 4 layers
parameters: {"last_n_layers":4}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
BigramHash
Bigram hash embedding
parameters: {"vocab_size":1536}
VE128
Value residual enhancement on selected layers
parameters: {"layers":[9,10],"dimension":128}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Regularization
layerwise LN scale
parameters: null
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"momentum":0.9,"freeze_blocks":0,"batch_seqs":32,"grad_clip":1}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}

Novel Contributions

  • 11-layer local unlimited-compute run on a single RTX 4060 Ti 16GB
  • LeakyReLU^2 MLP stack combined with Parallel Muon, XSA, Partial RoPE, and layerwise LN scale
  • EMA + SWA training with legal score-first TTT
  • int6 + lzma export to fit under the 16MB artifact cap
  • Sliding-window evaluation followed by backward-looking test-time training