PR #645

open

Non-record: Skill Forge — Autonomous ML Experimentation System (Local RTX 4070)

by FlynnCruseView on GitHub
val_bpb
1.8990
Architecture
Transformer
Optimizer
Muon variants
Artifact Size

Training Techniques

Architecture
XSA
Cross-Shaped Attention for improved architecture
parameters: null
Partial RoPE
Partial Rotary Positional Embeddings
parameters: null
SmearGate
SmearGate implementation for gating mechanism
parameters: null
MLP3x
3x MLP layers scaling
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"variants":["Muon+","NorMuon","MUD","RMNP","Mousse","AdEMAMix"],"EMA":true,"warmdown":true}
Quantization
int6 QAT
bits: 6
scope: all
GPTQ-lite
bits: null
scope: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: null
Initialization
OrthoInit
Orthogonal initialization
muP scaling
μ-Parameterization scaling for initialization
Sequence Length
sequence_length
train_length: 512
eval_length: 512
Weight Averaging
EMA
parameters: null

Novel Contributions

  • Skill Forge: an autonomous ML experimentation system that runs autoresearch-style loops to propose, test, and evolve optimization strategies automatically.
  • Use of Claude Code skills to encode domain knowledge and evolve heuristics into specific playbooks based on experiment results.
  • Integration of deep research from 13+ recent arXiv papers and analysis of all 21 leaderboard submissions to seed domain skills.
  • Meta-layer that evaluates skill effectiveness every 5 experiments and crystallizes heuristics into playbooks.
  • Demonstration of technique transferability from local RTX 4070 scaled-down experiments to full competition scale on 8×H100.
  • Automated outer loop researcher system that modifies training scripts, runs compliant 10-minute experiments, and learns from results.
  • Use of multiple Muon optimizer variants and advanced compression techniques (int6 QAT, GPTQ-lite) validated locally.
  • Handling compute constraints by scaling model and sequence length while preserving relative technique rankings.