PR #1081

open

Non-record: Swarm-Guided KG-Conditioned Training (val_bpb=1.1220)

by michaelwinczukView on GitHub

val_bpb

1.1220

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,955,969 bytes

Training Techniques

Quantization

int6

bits: 6

scope: all

QAT

bits: null

scope: all

Architecture

BigramHash

Uses bigram hashing as part of the base stack.

parameters: {"size":2048}

SmearGate

Applies SmearGate in the model stack.

parameters: null

XSA

Uses XSA in the last layers.

parameters: {"layers":4}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"squared":true,"slope":0.75}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: null

Regularization

LN scale

parameters: null

Other

other

Multi-agent swarm with 4 rule-based agents uses consensus voting to steer training decisions during training.

parameters: {"agents":4,"decision_interval_steps":800}

other

Knowledge graph-conditioned embedding initialization using PageRank-derived token importance scores from a 500K-node typed-edge knowledge graph.

parameters: {"nodes":500292,"edges":121084,"token_importance_scores":358}

LR Schedule

warmdown

parameters: {"qat_safety_deadline":0.65}

Novel Contributions

Multi-agent swarm that makes training decisions via consensus voting during training
Knowledge graph-conditioned embedding initialization using a large typed-edge knowledge graph
Rule-based agent roles for QAT timing, KG weighting, gradient health, and MTP weighting
Very low-overhead swarm control integrated into the training loop
Transparent decision log for training interventions