val_bpb
1.0916
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,973,374 bytes
Training Techniques
Architecture
SmearGate
Uses SmearGate in the donor trunk.
parameters: null
BigramHash
Uses BigramHash in the donor trunk.
parameters: null
MLP3x
Uses a 3x MLP expansion in the donor trunk.
parameters: null
shared sparse sidecar
A late-stage auxiliary sidecar reused across multiple late layers, with learned site embeddings and residual scales, implemented as gate -> value -> depthwise conv -> proj.
parameters: {"start_layer":8,"hidden_dim":48}
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
AdamW TTT
parameters: {"epochs":10}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: null
Other
other
Cloud-legal trimming to fit under the 16,000,000-byte cap by reducing sidecar width, bigram width, and wallclock budget.
parameters: {"sparse_hidden_dim":{"from":64,"to":48},"bigram_dim":{"from":128,"to":96},"max_wallclock_seconds":{"from":600,"to":596}}
Novel Contributions
- Shared sparse sidecar architecture injected only in late layers
- Shared sidecar weights reused across multiple insertion sites
- Learned site embeddings and learned residual scales for site-specific conditioning
- Late local-refinement path implemented as gate -> value -> depthwise conv -> proj
- Cloud-legal deployment of the sidecar under the 16MB artifact cap
- 3-seed cloud reproduction with mean val_bpb 1.09161722