PR #256

open

DenseContextQuantTrim 8xH100: 1.1779 val_bpb

by IvGolovachView on GitHub

val_bpb

1.1779

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,981,108 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query style attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Quantization

int8

bits: 8

scope: final model with hybrid fp16/int8 token embeddings

Evaluation

sliding window eval

parameters: {"context_length":2048,"stride_tokens":512}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Compression

zlib

level: null

Other

other

Clip-search post-training quantization with candidate clipping thresholds.

parameters: {"candidates":[1,0.95,0.9,0.85]}

Novel Contributions

Clean under-cap 8xH100 snapshot for the 10 minute / 16,000,000 byte track
Clip-search PTQ
Hybrid fp16/int8 export for token embeddings with top rows kept in fp16
Sliding-window validation at 2048 context with 512-token stride
Tied-embedding dense transformer baseline with grouped KV heads