PR #546

closed

Int5/Int6+Zstd+MLP3x: mean val_bpb=1.1752 (10L, seq4096, sliding window)

by shajalahamedcseView on GitHub

val_bpb

1.1752

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,708,798 B

Training Techniques

Quantization

mixed int5/int6

bits: 5

scope: MLP matrices

mixed int5/int6

bits: 6

scope: attention matrices

Architecture

MLP3x

Expanded MLP hidden size from 1024 to 1536 using savings from quantization.

parameters: {"hidden":1536,"baseline_hidden":1024}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":3600}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.04}

Initialization

Overtone init

Regularization

weight decay

parameters: null