PR #1383

open

Non-record: Neuromodulatory Depth-Recurrent Transformer with FiLM-only TTT (WIP, val_bpb=1.3151)

by nirmathurView on GitHub

val_bpb

1.3151

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

12.87 MB

Training Techniques

Architecture

depth recurrence

Shares transformer block weights across repeated virtual layers to reduce parameters while preserving depth.

parameters: {"physical_blocks":9,"virtual_layers":11,"shared_blocks":["3-4","9-10"]}

weight tying

Partial weight sharing between selected transformer blocks.

parameters: {"shared_pairs":["3-4","9-10"]}

FiLM

Per-loop scale/shift conditioning vectors used to distinguish repeated executions of shared blocks.

parameters: {"pairs":4}

LeakyReLU

LeakyReLU squared activation used in the base stack.

parameters: {"slope":0.5}

XSA

XSA applied to the last virtual layers.

parameters: {"last_n":4}

BigramHash

Bigram hash embedding component in the base stack.

parameters: {"vocab_size":1536}

VE128

Value enhancement module enabled on later layers.

parameters: {"dim":128,"layers":[9,10]}

U-Net skip connections

Skip connections retained from the PR #549 stack.

parameters: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

int6 QAT

bits: 6

scope: model

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

FiLM-only TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"momentum":0.9}

Compression

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3500}

Regularization

LN scale

parameters: null

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Adam

weight_decay: 0.04

momentum: null

other_params: {"used_for":"FiLM parameters"}

Novel Contributions

Depth-recurrent transformer with partial weight sharing across selected blocks
FiLM conditioning vectors to disambiguate repeated shared-block iterations
FiLM-only test-time training for shared blocks to avoid gradient compounding
Improved val_bpb with fewer parameters than the PR #549 baseline