PR #1047

open

(0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed.

by newjordanView on GitHub
val_bpb
0.8822
Architecture
Hybrid
Optimizer
Artifact Size
~9.9MB

Training Techniques

Architecture
depth recurrence
4 flat layers plus 1 crawler layer repeated across 4 loops (Frugendorff compression).
parameters: {"layers":4,"crawler_layers":1,"loops":4}
DeltaNet
Uses DeltaNet heads with canonical chunk_delta_rule from fla.ops.delta_rule.
parameters: {"heads":4}
Quantization
GPTQ
bits: 6
scope: 41 layers
Compression
zstd
level: null
Weight Averaging
EMA
parameters: {"start_step":4400,"decay":0.99}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_iters":2000}
Other
other
GPTQ reserve time stops training early so calibration runs within the 600s wallclock budget.
parameters: {"gptq_reserve_ms":30000}

Novel Contributions

  • Legal resubmission fixing GPTQ calibration timing to stay within the 600s wallclock budget
  • DeltaNet crawler architecture with 4 flat layers plus 1 crawler layer repeated over 4 loops
  • Loop-aware two-phase GPTQ calibration for 41 layers
  • EMA-based post-training improvement with reported 3-seed mean BPB of 0.8822
  • Use of canonical DeltaNet kernel chunk_delta_rule