PR #1849

open

Draft Base PR1493 + pruning + q-symbol entropy coding compute-grant checkpoint

by VedantKmr0View on GitHub

val_bpb

1.0850

Architecture

Transformer

Optimizer

—

Artifact Size

16,034,750 bytes

Training Techniques

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

TTT

parameters: null

Quantization

GPTQ

bits: 6

scope: attention Q/K/V/O and MLP fc/proj weights

GPTQ

bits: 8

scope: token embedding

Regularization

structured pruning

parameters: {"target":"MLP hidden channels","strategy":"per-block capped pruning","ablation":"zero selected fc rows and matching proj columns"}

structured pruning

parameters: {"target":"MLP hidden channels","strategy":"soft-cap pruning","score_weights":{"activation_weighted_score":0.7,"norm_score":0.3},"local_rank_weight":0.75,"cap_multiplier":1.75,"floor_multiplier":0}

Compression

Brotli

level: null

rANS

level: null

Novel Contributions

Per-block capped structured pruning of MLP hidden channels to avoid pruning collapse into a single late block
Soft-cap pruning that blends within-block and global channel ranking with relaxed per-block caps
Static arithmetic/rANS-style q-symbol entropy coding estimates for quantized weights
Combination of pruning and entropy coding as a path toward smaller non-record checkpoints
Zero-ablation evaluation workflow for structured pruning masks