PR #182

open

Non-record: Linearized Neural Memory + TTT (val_bpb=1.1844)

by mihir-s-05View on GitHub

val_bpb

1.1844

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.5 MB

Training Techniques

Architecture

linearized neural memory

Titans-inspired neural memory added to each transformer block; cumulative gradient update is linearized into causal linear attention via cumsum/einsum and used as a gated residual between attention and MLP.

parameters: {"layers":10,"params_per_layer_overhead":"~8k"}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["Q","V","lm_head"]}

Initialization

overtone spectral embedding init

Uses overtone spectral embedding initialization with phase-transition residual mixing.

Quantization

int6

bits: 6

scope: middle layers

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: {"decoupled":true}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings/scalars"}

LR Schedule

warmdown

parameters: {"warmdown_steps":2500}

Other

other

FP16 embedding bypass

parameters: null

Novel Contributions

Adds a Titans-inspired neural memory module to each transformer block
Linearizes the memory update into causal linear attention using cumsum and einsum for fullgraph compilation
Places memory between attention and MLP as a gated residual
Combines the memory module with LoRA-based test-time training
Uses overtone spectral embedding initialization and phase-transition residual mixing
Applies FP16 embedding bypass and int6 quantization on middle layers
Uses Muon weight decay with AdamW for embeddings/scalars