PR #988

closed

Record-track submission: 11L XSA4 + Late Shared Workspace Adapter (LSWA-64x4) + MLP2.5

by ymrohitView on GitHub
val_bpb
1.0857
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,900,041 bytes

Training Techniques

Architecture
XSA
XSA applied on the last 4 decoder layers.
parameters: {"layers":4}
BigramHash
Bigram path retained from the donor line.
parameters: null
VE128
VE path retained on late layers.
parameters: {"layers":[9,10]}
MLP3x
Main-trunk MLP multiplier reduced to 2.5 to fit the workspace adapter under the size cap.
parameters: {"multiplier":2.5}
other
Late Shared Workspace Adapter with shared token-to-workspace-to-token writeback in the late decoder.
parameters: {"name":"LSWA-64x4","latent_channels":64,"workspace_slots":4,"heads":4,"think_steps":1,"active_from_block":5}
Test-Time Training
score-first TTT
parameters: null
Evaluation
exact post-quant eval
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null

Novel Contributions

  • Late Shared Workspace Adapter (LSWA-64x4) with shared late writeback
  • Workspace tokens refine through a compact latent workspace and write back into token states
  • Shared adapter weights reused across late decoder sites
  • MLP multiplier trimmed to 2.5 to keep the model under the 16MB cap
  • Exact post-quant evaluation deployment with a record-folder packaged trainer