PR #1028

open

Medusa: Unstable — DeltaNet Crawler 0.8104 BPB 10mb file size(best seed), mean 0.9984, Frugendorff continuation

by newjordanView on GitHub
val_bpb
0.8104
Architecture
Transformer
Optimizer
Artifact Size
10MB

Training Techniques

Architecture
DeltaNet
Uses canonical chunk_delta_rule DeltaNet heads inside a Frugendorff crawler topology.
parameters: {"heads":4,"short_conv":true,"loops":4,"flat_layers":4,"crawler_layers":1}
RoPE
Uses RoPE dimensions as part of the model configuration.
parameters: {"dimensions":16}
BigramHash
Uses a bigram vocabulary/hash-style component in the architecture.
parameters: {"vocab_size":2048}
Quantization
int6
bits: 6
scope: model weights
GPTQ
bits: null
scope: 41 layers
Compression
zstd
level: null
Weight Averaging
EMA
parameters: {"start_step":4400,"decay":0.99}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"iters":2000}
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Loop-aware GPTQ with quantized-flat activations and crawler Hessians.
parameters: {"enabled":true}

Novel Contributions

  • DeltaNet crawler topology with canonical chunk_delta_rule heads
  • Loop-aware GPTQ using flat Hessians first, then crawler Hessians with quantized-flat activations
  • Late-start EMA re-initialized at warmdown onset
  • High-variance multi-seed submission with best seed 0.8104 BPB
  • Int6 + zstd artifact compression