PR #894

open

Non-record: Semantic Tube Regularization — Geometry Improves, BPB Doesn't (Compute–Regularization Tradeoff)

by albertorkiveView on GitHub

val_bpb

1.1821

Architecture

Transformer

Optimizer

Muon

Artifact Size

21.45 MB

Training Techniques

Weight Averaging

EMA

parameters: {"alpha":0.997,"from_init":true}

Architecture

XSA

XSA enabled on the last 4 layers

parameters: {"layers":4}

SmearGate

SmearGate enabled in the backbone

parameters: null

RoPE

NTK-aware RoPE

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Regularization

weight decay

parameters: {"value":0.04}

semantic tube regularization

parameters: {"lambda_tube":0.0005,"loss":"second-difference penalty on hidden-state trajectories"}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

sequence_length

train_length: 1536

eval_length: 1536

sequence_length

train_length: 2048

eval_length: 2048

Semantic tube regularization using a second-difference penalty on hidden-state trajectories
Discovery that the regularizer improves bpb in cheaper proxy runs but becomes neutral or slightly harmful on the full compiled fast path
Demonstration that the regularizer strongly reduces hidden-state curvature and improves drift alignment without representation collapse
Evidence for a compute-budget-dependent regularization tradeoff in competition settings
Matched discovery and confirmatory runs across seq1024, seq1536, and seq2048