PR #663

open

V18 Manifold-Guided Architecture — val_bpb 0.434

by raahilgView on GitHub
val_bpb
0.4380
Architecture
GNN-like message passing network on a precomputed token interaction graph
Optimizer
Artifact Size
15.70 MB

Training Techniques

Architecture
manifold-guided token interaction graph
Precomputes a frozen token manifold from corpus co-occurrence statistics and uses it as graph topology for message passing instead of learning token geometry from scratch.
parameters: {"vocab":1024,"spectral_dims":320,"hops":4,"attention_heads":2,"hidden_dim":500}
sparsemax routing
Uses sparsemax-weighted aggregation for differentiable sparse edge routing along manifold geodesics.
parameters: null
spectrally-modulated gated hop cells
Hop updates are conditioned on spectral coordinates derived from Hessian eigendecomposition and SVD coordinates.
parameters: {"hops":4}
manifold-guided attention
Applies attention conditioned on manifold/spectral coordinates to exploit the frozen geometric prior.
parameters: {"heads":2}
parallel transport across token manifold
Uses manifold-aware transport of representations across the token graph.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.999,"snapshot_at_best_loss":true}
LR Schedule
cosine decay + hold + linear warmdown
parameters: {"cosine_decay_to_fraction":0.1,"cosine_decay_steps":3400,"hold_steps":[3400,5500],"linear_warmdown_to_zero":true}
Quantization
int8
bits: 8
scope: per-row weights
Compression
zlib
level: null
Initialization
deterministic physics simulation initialization
Token manifold positions are initialized by a fixed-seed CPU physics simulation based on co-occurrence-derived forces.
Other
other
Builds a frozen token manifold from co-occurrence, directional torsion, entropic mass, directed springs, and syntactic bigram forces, then computes Hessian eigendecomposition and SVD coordinates for spectral features.
parameters: {"physics_steps":5000,"spectral_modes":256,"svd_coords":64}
other
Uses deterministic compilation settings to avoid nondeterministic kernel selection.
parameters: {"max_autotune":false}
other
Single-GPU training with selective gradient strategy to preserve hop specialization; hop parameters use rank 0 local gradients while non-hop parameters are averaged.
parameters: null

Novel Contributions

  • Frozen precomputed token manifold used as graph topology for the model
  • Physics-simulated manifold construction from corpus co-occurrence statistics
  • Sparsemax routing along manifold geodesics
  • Spectral-coordinate-conditioned attention and gated hop updates
  • EMA snapshot at best loss for improved quantization
  • Adaptive per-row int8 quantization with percentile clipping
  • Deterministic physics simulation and deterministic compilation for reproducibility
  • Selective gradient strategy to preserve hop specialization in single-GPU training