PR #309

open

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)

by NewyorkDevView on GitHub

val_bpb

1.1914

Architecture

Transformer

Optimizer

Muon

Artifact Size

11.5 MB

Training Techniques

Quantization

mixed int6/int8

bits: null

scope: boundary layers int8, middle layers int6, tied embeddings fp16, control tensors fp32

Architecture

tied embeddings

FP16 tied input/output embeddings passthrough due to dual role and sensitivity.

parameters: null

KV head count

GQA architecture with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Uses tied embeddings between input and output.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"learning_rate":0.03,"batch_tokens":393000}

Regularization

weight decay

parameters: {"start":0.02,"end":0.08,"schedule":"cosine warmdown"}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown cosine schedule

parameters: {"weight_decay":{"start":0.02,"end":0.08}}

Initialization

spectral init

FP16 tied embeddings with overtone spectral initialization.

Novel Contributions

CLASE-inspired adaptive per-layer quantization
Non-uniform quantization allocation with int8 boundary layers and int6 middle layers
FP16 passthrough for tied embeddings and FP32 passthrough for control tensors
Ramping weight decay during warmdown to tighten weight distributions for quantization
Extended context training at 2048 sequence length
Sliding window evaluation with stride 64