PR #555

closed

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)

by ymrohitView on GitHub

val_bpb

1.0916

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,973,374 bytes

Training Techniques

Architecture

SmearGate

Uses SmearGate in the donor trunk.

parameters: null

BigramHash

Uses BigramHash in the donor trunk.

parameters: null

MLP3x

Uses a 3x MLP expansion in the donor trunk.

parameters: null

shared sparse sidecar

A late-stage auxiliary sidecar reused across multiple late layers, with learned site embeddings and residual scales, implemented as gate -> value -> depthwise conv -> proj.

parameters: {"start_layer":8,"hidden_dim":48}

Weight Averaging

EMA

parameters: {"decay":0.997}

Test-Time Training

AdamW TTT

parameters: {"epochs":10}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zstd

level: null

Other

other

Cloud-legal trimming to fit under the 16,000,000-byte cap by reducing sidecar width, bigram width, and wallclock budget.

parameters: {"sparse_hidden_dim":{"from":64,"to":48},"bigram_dim":{"from":128,"to":96},"max_wallclock_seconds":{"from":600,"to":596}}

Novel Contributions

Shared sparse sidecar architecture injected only in late layers
Shared sidecar weights reused across multiple insertion sites
Learned site embeddings and learned residual scales for site-specific conditioning
Late local-refinement path implemented as gate -> value -> depthwise conv -> proj
Cloud-legal deployment of the sidecar under the 16MB artifact cap
3-seed cloud reproduction with mean val_bpb 1.09161722