PR #768
openNon-record: 1.1201 BPB - Shared ValueEmbedding (tok_emb reuse, layers 5-10) + Legal TTT
by mradassaadView on GitHub
val_bpb
1.1201
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB
Training Techniques
Architecture
weight tying
Reuses the tied token embedding (tok_emb) for ValueEmbedding instead of training a separate embedding table, with learned projection and per-layer scales.
parameters: {"layers":[5,6,7,8,9,10]}
ValueEmbedding
Expanded ValueEmbedding coverage from 2 layers to 6 layers by freeing parameter budget through shared tok_emb reuse.
parameters: {"layers":[5,6,7,8,9,10]}
MLP3x
Uses a 3x MLP with LeakyReLU(0.5)^2 as part of the base stack.
parameters: null
BigramHash
Includes BigramHash as part of the architecture.
parameters: {"size":1536}
XSA
Uses XSA in the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Applies RoPE partially to a subset of dimensions.
parameters: {"dimensions":[16,64]}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"type":"Tight SWA"}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"chunk_tokens":32768,"batch_seqs":32,"freeze_blocks":0,"grad_clip":1}
Novel Contributions
- Reuses tied tok_emb as the ValueEmbedding source instead of training a separate embedding table.
- Expands ValueEmbedding from layers 9-10 to layers 5-10 using the freed parameter budget.
- Combines shared ValueEmbedding with Legal TTT on top of the PR #549 stack.
- Achieves 1.1201 bpb 3-seed mean with consistent sub-16MB artifacts.