PR #894
openNon-record: Semantic Tube Regularization — Geometry Improves, BPB Doesn't (Compute–Regularization Tradeoff)
by albertorkiveView on GitHub
val_bpb
1.1821
Architecture
Transformer
Optimizer
Muon
Artifact Size
21.45 MB
Training Techniques
Weight Averaging
EMA
parameters: {"alpha":0.997,"from_init":true}
Architecture
XSA
XSA enabled on the last 4 layers
parameters: {"layers":4}
SmearGate
SmearGate enabled in the backbone
parameters: null
RoPE
NTK-aware RoPE
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Regularization
weight decay
parameters: {"value":0.04}
semantic tube regularization
parameters: {"lambda_tube":0.0005,"loss":"second-difference penalty on hidden-state trajectories"}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
sequence_length
train_length: 1536
eval_length: 1536
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Semantic tube regularization using a second-difference penalty on hidden-state trajectories
- Discovery that the regularizer improves bpb in cheaper proxy runs but becomes neutral or slightly harmful on the full compiled fast path
- Demonstration that the regularizer strongly reduces hidden-state curvature and improves drift alignment without representation collapse
- Evidence for a compute-budget-dependent regularization tradeoff in competition settings
- Matched discovery and confirmatory runs across seq1024, seq1536, and seq2048