PR #1849
openDraft Base PR1493 + pruning + q-symbol entropy coding compute-grant checkpoint
by VedantKmr0View on GitHub
val_bpb
1.0850
Architecture
Transformer
Optimizer
—
Artifact Size
16,034,750 bytes
Training Techniques
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: null
Quantization
GPTQ
bits: 6
scope: attention Q/K/V/O and MLP fc/proj weights
GPTQ
bits: 8
scope: token embedding
Regularization
structured pruning
parameters: {"target":"MLP hidden channels","strategy":"per-block capped pruning","ablation":"zero selected fc rows and matching proj columns"}
structured pruning
parameters: {"target":"MLP hidden channels","strategy":"soft-cap pruning","score_weights":{"activation_weighted_score":0.7,"norm_score":0.3},"local_rank_weight":0.75,"cap_multiplier":1.75,"floor_multiplier":0}
Compression
Brotli
level: null
rANS
level: null
Novel Contributions
- Per-block capped structured pruning of MLP hidden channels to avoid pruning collapse into a single late block
- Soft-cap pruning that blends within-block and global channel ranking with relaxed per-block caps
- Static arithmetic/rANS-style q-symbol entropy coding estimates for quantized weights
- Combination of pruning and entropy coding as a path toward smaller non-record checkpoints
- Zero-ablation evaluation workflow for structured pruning masks