PR #1394

RECORDopen

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

by clarkkevView on GitHub
val_bpb
1.0856
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,985,678 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: matrix parameters
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
Loop layers 4-5 twice while sharing parameters.
parameters: {"layers":[4,5],"loops":2}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"row_normalized":true}
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Other
other
Increase vocabulary size to 8192 using SP8192 tokenizer/data.
parameters: {"vocab_size":8192}
other
Use standard-deviation-based clipping for quantization thresholds (SDClip).
parameters: {"clip_scale_k":12.85,"embedding_clip_scale_k":20}
other
Replace coprime-stride loader with a simpler ShuffledSequenceLoader.
parameters: null
other
Remove value embeddings.
parameters: null

Novel Contributions

  • Increased vocabulary size from 4096 to 8192 (SP8192).
  • GPTQ-quantized the embedding matrix instead of round-to-nearest quantization.
  • Removed value embeddings.
  • Replaced the coprime-stride loader with a simpler ShuffledSequenceLoader.
  • Applied depth recurrence by looping layers 4-5 twice with shared parameters.
  • Used row-normalized Muon.
  • Introduced standard-deviation-based clipping for quantization thresholds (SDClip).