PR #555

closed

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)

by ymrohitView on GitHub
val_bpb
1.0916
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,973,374 bytes

Training Techniques

Architecture
SmearGate
Uses SmearGate in the donor trunk.
parameters: null
BigramHash
Uses BigramHash in the donor trunk.
parameters: null
MLP3x
Uses a 3x MLP expansion in the donor trunk.
parameters: null
shared sparse sidecar
A late-stage auxiliary sidecar reused across multiple late layers, with learned site embeddings and residual scales, implemented as gate -> value -> depthwise conv -> proj.
parameters: {"start_layer":8,"hidden_dim":48}
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
AdamW TTT
parameters: {"epochs":10}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: null
Other
other
Cloud-legal trimming to fit under the 16,000,000-byte cap by reducing sidecar width, bigram width, and wallclock budget.
parameters: {"sparse_hidden_dim":{"from":64,"to":48},"bigram_dim":{"from":128,"to":96},"max_wallclock_seconds":{"from":600,"to":596}}

Novel Contributions

  • Shared sparse sidecar architecture injected only in late layers
  • Shared sidecar weights reused across multiple insertion sites
  • Learned site embeddings and learned residual scales for site-specific conditioning
  • Late local-refinement path implemented as gate -> value -> depthwise conv -> proj
  • Cloud-legal deployment of the sidecar under the 16MB artifact cap
  • 3-seed cloud reproduction with mean val_bpb 1.09161722