PR #1833

open

WIP Record: SP8192 + CaseOps + Depth Curriculum + FreqGPTQ + PPM adaptive-λ mixture — val_bpb 0.90687688 (1-seed)

by pragnyanramthaView on GitHub

val_bpb

0.9069

Architecture

Transformer

Optimizer

—

Artifact Size

~24.5 MB

Training Techniques

Architecture

depth recurrence

Depth curriculum stack with CaseOps and depth progression 1→3→4.

parameters: {"layers":[1,3,4]}

Quantization

GPTQ

bits: 6

scope: all

Other

other

FreqGPTQ: upweights the top-100 most frequent calibration tokens by 2× during Hessian collection to improve int6 quantization quality on frequent vocabulary items.

parameters: {"top_k":100,"weight_multiplier":2}

other

PPM-D adaptive-λ mixture: byte-level PPM order-5 predictor mixed with NN log-probs at evaluation time using an adaptive gate.

parameters: {"ppm_order":5,"lambda_high_confidence":0.05,"lambda_low_confidence":0.9,"confidence_threshold":0.9}

Test-Time Training

full TTT

parameters: null

Weight Averaging

EMA

parameters: null

Novel Contributions

FreqGPTQ frequency-weighted calibration for GPTQ
PPM-D adaptive-λ mixture at evaluation time
Depth curriculum stack with CaseOps
Single-seed screening run with reported val_bpb improvement