PR #1833

open

WIP Record: SP8192 + CaseOps + Depth Curriculum + FreqGPTQ + PPM adaptive-λ mixture — val_bpb 0.90687688 (1-seed)

by pragnyanramthaView on GitHub
val_bpb
0.9069
Architecture
Transformer
Optimizer
Artifact Size
~24.5 MB

Training Techniques

Architecture
depth recurrence
Depth curriculum stack with CaseOps and depth progression 1→3→4.
parameters: {"layers":[1,3,4]}
Quantization
GPTQ
bits: 6
scope: all
Other
other
FreqGPTQ: upweights the top-100 most frequent calibration tokens by 2× during Hessian collection to improve int6 quantization quality on frequent vocabulary items.
parameters: {"top_k":100,"weight_multiplier":2}
other
PPM-D adaptive-λ mixture: byte-level PPM order-5 predictor mixed with NN log-probs at evaluation time using an adaptive gate.
parameters: {"ppm_order":5,"lambda_high_confidence":0.05,"lambda_low_confidence":0.9,"confidence_threshold":0.9}
Test-Time Training
full TTT
parameters: null
Weight Averaging
EMA
parameters: null

Novel Contributions

  • FreqGPTQ frequency-weighted calibration for GPTQ
  • PPM-D adaptive-λ mixture at evaluation time
  • Depth curriculum stack with CaseOps
  • Single-seed screening run with reported val_bpb improvement