PR #663

open

V18 Manifold-Guided Architecture — val_bpb 0.434

by raahilgView on GitHub

val_bpb

0.4380

Architecture

GNN-like message passing network on a precomputed token interaction graph

Optimizer

—

Artifact Size

15.70 MB

Training Techniques

Architecture

manifold-guided token interaction graph

Precomputes a frozen token manifold from corpus co-occurrence statistics and uses it as graph topology for message passing instead of learning token geometry from scratch.

parameters: {"vocab":1024,"spectral_dims":320,"hops":4,"attention_heads":2,"hidden_dim":500}

sparsemax routing

Uses sparsemax-weighted aggregation for differentiable sparse edge routing along manifold geodesics.

parameters: null

spectrally-modulated gated hop cells

Hop updates are conditioned on spectral coordinates derived from Hessian eigendecomposition and SVD coordinates.

parameters: {"hops":4}

manifold-guided attention

Applies attention conditioned on manifold/spectral coordinates to exploit the frozen geometric prior.

parameters: {"heads":2}

parallel transport across token manifold

Uses manifold-aware transport of representations across the token graph.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.999,"snapshot_at_best_loss":true}

LR Schedule

cosine decay + hold + linear warmdown

parameters: {"cosine_decay_to_fraction":0.1,"cosine_decay_steps":3400,"hold_steps":[3400,5500],"linear_warmdown_to_zero":true}

Quantization

int8

bits: 8

scope: per-row weights

Compression

zlib

level: null

Initialization

deterministic physics simulation initialization

Token manifold positions are initialized by a fixed-seed CPU physics simulation based on co-occurrence-derived forces.

Other

other

Builds a frozen token manifold from co-occurrence, directional torsion, entropic mass, directed springs, and syntactic bigram forces, then computes Hessian eigendecomposition and SVD coordinates for spectral features.

parameters: {"physics_steps":5000,"spectral_modes":256,"svd_coords":64}

other

Uses deterministic compilation settings to avoid nondeterministic kernel selection.

parameters: {"max_autotune":false}

other

Single-GPU training with selective gradient strategy to preserve hop specialization; hop parameters use rank 0 local gradients while non-hop parameters are averaged.

parameters: null

Novel Contributions

Frozen precomputed token manifold used as graph topology for the model
Physics-simulated manifold construction from corpus co-occurrence statistics
Sparsemax routing along manifold geodesics
Spectral-coordinate-conditioned attention and gated hop updates
EMA snapshot at best loss for improved quantization
Adaptive per-row int8 quantization with percentile clipping
Deterministic physics simulation and deterministic compilation for reproducibility
Selective gradient strategy to preserve hop specialization in single-GPU training