val_bpb
1.0611
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
weight tying
Replaces the static tied embedding with a seed table plus hypernetworks that generate input embeddings and output projection weights from token seeds and causal context.
parameters: {"d_seed":64,"d_hidden":256,"ctx_window":3}
other
Use-theoretic embedding layer with causal context-conditioned input hypernetwork and regenerated output projection.
parameters: {"vocab_size":8192,"d_model":512}
BigramHash
Referenced as part of the existing stack and prior work comparison.
parameters: null
SmearGate
Referenced as part of the existing PR #1855 stack.
parameters: null
XSA
Referenced as part of the existing PR #1855 stack.
parameters: null
Gated Attention
Referenced as part of the existing PR #1855 stack.
parameters: null
weight tying
The submission discusses replacing the tied embedding table and regenerating output weights from the seed table.
parameters: null
Quantization
GPTQ
bits: null
scope: hypernet weights and tensors without Hessians
fp16
bits: 16
scope: fallback for tensors without registered Hessians
Test-Time Training
LoRA TTT
parameters: null
Novel Contributions
- Replaces the static V×d tied embedding with a smaller V×d_seed seed table plus hypernetworks.
- Introduces a causal, context-conditioned input embedding generated at use-time.
- Regenerates output projection weights from the same seed table instead of storing a full static embedding matrix.
- Claims a 5.02× parameter reduction for the embedding component (4,194,304 → 835,584).
- Finds and fixes three integration bugs via static analysis before GPU execution.
- Pre-registers matched-budget and matched-architecture ablation conditions for evaluation.