PR #1629

open

Notable Non-Record: Switched Deep Supervision (first DS submission)

by channyzf6View on GitHub

val_bpb

1.0829

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,997,104 bytes

Training Techniques

Architecture

weight tying

Shared LM head / tied embedding reused for auxiliary deep supervision losses.

parameters: null

depth recurrence

Loops layers 3-5 three times with activation at 35% of training.

parameters: {"layers":[3,4,5],"repeats":3,"activate_at_frac":0.35}

XSA

Uses XSA attention on all layers.

parameters: {"layers":11}

LeakyReLU

MLP activation uses LeakyReLU squared.

parameters: {"slope":0.5}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"epochs":3}

Quantization

GPTQ

bits: 6

scope: MLP and attention weights

GPTQ

bits: 7

scope: embeddings

Compression

brotli

level: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Other

other

Switched Deep Supervision: randomly selects one intermediate layer per step for auxiliary cross-entropy supervision through the shared LM head.

parameters: {"layers":[6,7,9],"alpha":0.01,"warmup_steps":200,"decay_start_frac":0.7,"decay_end_frac":0.85}

Novel Contributions

Switched Deep Supervision with randomly selected single-layer auxiliary supervision each step
Deep supervision via shared LM head with zero new parameters
Fraction-based DS alpha decay schedule
Per-layer adaptive GPTQ with int7 embeddings to fit the 16 MB limit
Documented negative results for predictive coding and multi-token prediction variants