PR #1629

open

Notable Non-Record: Switched Deep Supervision (first DS submission)

by channyzf6View on GitHub
val_bpb
1.0829
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,997,104 bytes

Training Techniques

Architecture
weight tying
Shared LM head / tied embedding reused for auxiliary deep supervision losses.
parameters: null
depth recurrence
Loops layers 3-5 three times with activation at 35% of training.
parameters: {"layers":[3,4,5],"repeats":3,"activate_at_frac":0.35}
XSA
Uses XSA attention on all layers.
parameters: {"layers":11}
LeakyReLU
MLP activation uses LeakyReLU squared.
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Quantization
GPTQ
bits: 6
scope: MLP and attention weights
GPTQ
bits: 7
scope: embeddings
Compression
brotli
level: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Other
other
Switched Deep Supervision: randomly selects one intermediate layer per step for auxiliary cross-entropy supervision through the shared LM head.
parameters: {"layers":[6,7,9],"alpha":0.01,"warmup_steps":200,"decay_start_frac":0.7,"decay_end_frac":0.85}

Novel Contributions

  • Switched Deep Supervision with randomly selected single-layer auxiliary supervision each step
  • Deep supervision via shared LM head with zero new parameters
  • Fraction-based DS alpha decay schedule
  • Per-layer adaptive GPTQ with int7 embeddings to fit the 16 MB limit
  • Documented negative results for predictive coding and multi-token prediction variants