PR #1081

open

Non-record: Swarm-Guided KG-Conditioned Training (val_bpb=1.1220)

by michaelwinczukView on GitHub
val_bpb
1.1220
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,955,969 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all
QAT
bits: null
scope: all
Architecture
BigramHash
Uses bigram hashing as part of the base stack.
parameters: {"size":2048}
SmearGate
Applies SmearGate in the model stack.
parameters: null
XSA
Uses XSA in the last layers.
parameters: {"layers":4}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"squared":true,"slope":0.75}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: null
Regularization
LN scale
parameters: null
Other
other
Multi-agent swarm with 4 rule-based agents uses consensus voting to steer training decisions during training.
parameters: {"agents":4,"decision_interval_steps":800}
other
Knowledge graph-conditioned embedding initialization using PageRank-derived token importance scores from a 500K-node typed-edge knowledge graph.
parameters: {"nodes":500292,"edges":121084,"token_importance_scores":358}
LR Schedule
warmdown
parameters: {"qat_safety_deadline":0.65}

Novel Contributions

  • Multi-agent swarm that makes training decisions via consensus voting during training
  • Knowledge graph-conditioned embedding initialization using a large typed-edge knowledge graph
  • Rule-based agent roles for QAT timing, KG weighting, gradient health, and MTP weighting
  • Very low-overhead swarm control integrated into the training loop
  • Transparent decision log for training interventions