PR #99

open

submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605)

val_bpb

1.1605

Architecture

GPT

Optimizer

Muon

Artifact Size

15,844,924 bytes

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: .mlp., .attn.c_q., .attn.c_v., .attn.proj. in int6; .attn.c_k. mostly grouped int8; selected late-layer c_k and tok_emb in fp16

Architecture

MLP3x

Uses a 3x MLP expansion to widen the hidden layer within the byte budget.

parameters: {"mlp_mult":3,"num_layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"tie_embeddings":1}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Initialization

QK gain init

Uses QK_GAIN_INIT=1.7 for attention initialization scaling.

Other

other

Selective late-layer K preservation keeps blocks.7.attn.c_k.weight and blocks.8.attn.c_k.weight in fp16 while other c_k matrices use grouped int8.

parameters: {"group_size":64}