val_bpb
1.0639
Architecture
Transformer
Optimizer
—
Artifact Size
15,972,854 bytes
Training Techniques
Architecture
Gate32
Widened gate window used in the frontier experiments.
parameters: {"gate_window":32,"smear_gate_window":12}
BigramHash
Small causal input feature branch tested for transfer to the #2018 frontier.
parameters: {"vocab_size":512,"dimensions":4,"bits":6}
Path-A-v3
Small Path-A-v3 branch combined with the BigramHash experiment.
parameters: {"small":true}
Other
other
q-aware token-only n-gram tilt applied during training/evaluation.
parameters: {"token_only":true,"dynamic":true}
Test-Time Training
score-first TTT
parameters: null
Sequence Length
sequence_length
train_length: 32
eval_length: 12
Novel Contributions
- Final autopsy of the PR #2018 frontier with three failed transfer attempts.
- Identified a stop rule: stop a branch if it is about +0.01 BPB worse before quantization on the same seed unless it adds a proven legal eval mechanism.
- Showed that Gate32 did not transfer to the #2018 stack.
- Showed that the q-aware n-gram patch was not the root cause of the regression.
- Tested a tiny BigramHash + Path-A-v3-small branch and found it did not recover training quality.