PR #1849

open

Draft Base PR1493 + pruning + q-symbol entropy coding compute-grant checkpoint

by VedantKmr0View on GitHub
val_bpb
1.0850
Architecture
Transformer
Optimizer
Artifact Size
16,034,750 bytes

Training Techniques

Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: null
Quantization
GPTQ
bits: 6
scope: attention Q/K/V/O and MLP fc/proj weights
GPTQ
bits: 8
scope: token embedding
Regularization
structured pruning
parameters: {"target":"MLP hidden channels","strategy":"per-block capped pruning","ablation":"zero selected fc rows and matching proj columns"}
structured pruning
parameters: {"target":"MLP hidden channels","strategy":"soft-cap pruning","score_weights":{"activation_weighted_score":0.7,"norm_score":0.3},"local_rank_weight":0.75,"cap_multiplier":1.75,"floor_multiplier":0}
Compression
Brotli
level: null
rANS
level: null

Novel Contributions

  • Per-block capped structured pruning of MLP hidden channels to avoid pruning collapse into a single late block
  • Soft-cap pruning that blends within-block and global channel ranking with relaxed per-block caps
  • Static arithmetic/rANS-style q-symbol entropy coding estimates for quantized weights
  • Combination of pruning and entropy coding as a path toward smaller non-record checkpoints
  • Zero-ablation evaluation workflow for structured pruning masks