PR #309
openRecord: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)
by NewyorkDevView on GitHub
val_bpb
1.1914
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.5 MB
Training Techniques
Quantization
mixed int6/int8
bits: null
scope: boundary layers int8, middle layers int6, tied embeddings fp16, control tensors fp32
Architecture
tied embeddings
FP16 tied input/output embeddings passthrough due to dual role and sensitivity.
parameters: null
KV head count
GQA architecture with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Uses tied embeddings between input and output.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"learning_rate":0.03,"batch_tokens":393000}
Regularization
weight decay
parameters: {"start":0.02,"end":0.08,"schedule":"cosine warmdown"}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown cosine schedule
parameters: {"weight_decay":{"start":0.02,"end":0.08}}
Initialization
spectral init
FP16 tied embeddings with overtone spectral initialization.
Novel Contributions
- CLASE-inspired adaptive per-layer quantization
- Non-uniform quantization allocation with int8 boundary layers and int6 middle layers
- FP16 passthrough for tied embeddings and FP32 passthrough for control tensors
- Ramping weight decay during warmdown to tighten weight distributions for quantization
- Extended context training at 2048 sequence length
- Sliding window evaluation with stride 64