PR #645

open

Non-record: Skill Forge — Autonomous ML Experimentation System (Local RTX 4070)

by FlynnCruseView on GitHub

val_bpb

1.8990

Architecture

Transformer

Optimizer

Muon variants

Artifact Size

—

Training Techniques

Architecture

XSA

Cross-Shaped Attention for improved architecture

parameters: null

Partial RoPE

Partial Rotary Positional Embeddings

parameters: null

SmearGate

SmearGate implementation for gating mechanism

parameters: null

MLP3x

3x MLP layers scaling

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"variants":["Muon+","NorMuon","MUD","RMNP","Mousse","AdEMAMix"],"EMA":true,"warmdown":true}

Quantization

int6 QAT

bits: 6

scope: all

GPTQ-lite

bits: null

scope: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

TTT

parameters: null

Initialization

OrthoInit

Orthogonal initialization

muP scaling

μ-Parameterization scaling for initialization

Sequence Length

sequence_length

train_length: 512

eval_length: 512

Weight Averaging

EMA

parameters: null

Novel Contributions

Skill Forge: an autonomous ML experimentation system that runs autoresearch-style loops to propose, test, and evolve optimization strategies automatically.
Use of Claude Code skills to encode domain knowledge and evolve heuristics into specific playbooks based on experiment results.
Integration of deep research from 13+ recent arXiv papers and analysis of all 21 leaderboard submissions to seed domain skills.
Meta-layer that evaluates skill effectiveness every 5 experiments and crystallizes heuristics into playbooks.
Demonstration of technique transferability from local RTX 4070 scaled-down experiments to full competition scale on 8×H100.
Automated outer loop researcher system that modifies training scripts, runs compliant 10-minute experiments, and learns from results.
Use of multiple Muon optimizer variants and advanced compression techniques (int6 QAT, GPTQ-lite) validated locally.
Handling compute constraints by scaling model and sequence length while preserving relative technique rankings.