PR #988

closed

Record-track submission: 11L XSA4 + Late Shared Workspace Adapter (LSWA-64x4) + MLP2.5

val_bpb

1.0857

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,900,041 bytes

Training Techniques

Architecture

XSA

XSA applied on the last 4 decoder layers.

parameters: {"layers":4}

BigramHash

Bigram path retained from the donor line.

parameters: null

VE128

VE path retained on late layers.

parameters: {"layers":[9,10]}

MLP3x

Main-trunk MLP multiplier reduced to 2.5 to fit the workspace adapter under the size cap.

parameters: {"multiplier":2.5}

other

Late Shared Workspace Adapter with shared token-to-workspace-to-token writeback in the late decoder.

parameters: {"name":"LSWA-64x4","latent_channels":64,"workspace_slots":4,"heads":4,"think_steps":1,"active_from_block":5}

Test-Time Training

score-first TTT

parameters: null

Evaluation

exact post-quant eval

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Late Shared Workspace Adapter (LSWA-64x4) with shared late writeback
Workspace tokens refine through a compact latent workspace and write back into token states
Shared adapter weights reused across late decoder sites
MLP multiplier trimmed to 2.5 to keep the model under the 16MB cap
Exact post-quant evaluation deployment with a record-folder packaged trainer