PR #1394

RECORDopen

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

by clarkkevView on GitHub

val_bpb

1.0856

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,985,678 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrix parameters

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

Loop layers 4-5 twice while sharing parameters.

parameters: {"layers":[4,5],"loops":2}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"row_normalized":true}

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 8192

eval_length: null

Other

other

Increase vocabulary size to 8192 using SP8192 tokenizer/data.

parameters: {"vocab_size":8192}

other

Use standard-deviation-based clipping for quantization thresholds (SDClip).

parameters: {"clip_scale_k":12.85,"embedding_clip_scale_k":20}

other

Replace coprime-stride loader with a simpler ShuffledSequenceLoader.

parameters: null

other

Remove value embeddings.

parameters: null

Novel Contributions

Increased vocabulary size from 4096 to 8192 (SP8192).
GPTQ-quantized the embedding matrix instead of round-to-nearest quantization.
Removed value embeddings.
Replaced the coprime-stride loader with a simpler ShuffledSequenceLoader.
Applied depth recurrence by looping layers 4-5 twice with shared parameters.
Used row-normalized Muon.
Introduced standard-deviation-based clipping for quantization thresholds (SDClip).