PR #309

open

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)

by NewyorkDevView on GitHub
val_bpb
1.1914
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.5 MB

Training Techniques

Quantization
mixed int6/int8
bits: null
scope: boundary layers int8, middle layers int6, tied embeddings fp16, control tensors fp32
Architecture
tied embeddings
FP16 tied input/output embeddings passthrough due to dual role and sensitivity.
parameters: null
KV head count
GQA architecture with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Uses tied embeddings between input and output.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"learning_rate":0.03,"batch_tokens":393000}
Regularization
weight decay
parameters: {"start":0.02,"end":0.08,"schedule":"cosine warmdown"}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown cosine schedule
parameters: {"weight_decay":{"start":0.02,"end":0.08}}
Initialization
spectral init
FP16 tied embeddings with overtone spectral initialization.

Novel Contributions

  • CLASE-inspired adaptive per-layer quantization
  • Non-uniform quantization allocation with int8 boundary layers and int6 middle layers
  • FP16 passthrough for tied embeddings and FP32 passthrough for control tensors
  • Ramping weight decay during warmdown to tighten weight distributions for quantization
  • Extended context training at 2048 sequence length
  • Sliding window evaluation with stride 64