PR #1383

open

Non-record: Neuromodulatory Depth-Recurrent Transformer with FiLM-only TTT (WIP, val_bpb=1.3151)

by nirmathurView on GitHub
val_bpb
1.3151
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
12.87 MB

Training Techniques

Architecture
depth recurrence
Shares transformer block weights across repeated virtual layers to reduce parameters while preserving depth.
parameters: {"physical_blocks":9,"virtual_layers":11,"shared_blocks":["3-4","9-10"]}
weight tying
Partial weight sharing between selected transformer blocks.
parameters: {"shared_pairs":["3-4","9-10"]}
FiLM
Per-loop scale/shift conditioning vectors used to distinguish repeated executions of shared blocks.
parameters: {"pairs":4}
LeakyReLU
LeakyReLU squared activation used in the base stack.
parameters: {"slope":0.5}
XSA
XSA applied to the last virtual layers.
parameters: {"last_n":4}
BigramHash
Bigram hash embedding component in the base stack.
parameters: {"vocab_size":1536}
VE128
Value enhancement module enabled on later layers.
parameters: {"dim":128,"layers":[9,10]}
U-Net skip connections
Skip connections retained from the PR #549 stack.
parameters: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
int6 QAT
bits: 6
scope: model
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
FiLM-only TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"momentum":0.9}
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}
Regularization
LN scale
parameters: null
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Adam
weight_decay: 0.04
momentum: null
other_params: {"used_for":"FiLM parameters"}

Novel Contributions

  • Depth-recurrent transformer with partial weight sharing across selected blocks
  • FiLM conditioning vectors to disambiguate repeated shared-block iterations
  • FiLM-only test-time training for shared blocks to avoid gradient compounding
  • Improved val_bpb with fewer parameters than the PR #549 baseline