PR #256

open

DenseContextQuantTrim 8xH100: 1.1779 val_bpb

by IvGolovachView on GitHub
val_bpb
1.1779
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,981,108 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
int8
bits: 8
scope: final model with hybrid fp16/int8 token embeddings
Evaluation
sliding window eval
parameters: {"context_length":2048,"stride_tokens":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Compression
zlib
level: null
Other
other
Clip-search post-training quantization with candidate clipping thresholds.
parameters: {"candidates":[1,0.95,0.9,0.85]}

Novel Contributions

  • Clean under-cap 8xH100 snapshot for the 10 minute / 16,000,000 byte track
  • Clip-search PTQ
  • Hybrid fp16/int8 export for token embeddings with top rows kept in fp16
  • Sliding-window validation at 2048 context with 512-token stride
  • Tied-embedding dense transformer baseline with grouped KV heads