PR #246

closed

non-record: int6 3xMLP + cosine warmdown (1.1704 bpb)

val_bpb

1.1704

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

MLP3x

Widened the MLP from 2x to 3x using freed artifact space from int6 quantization.

parameters: {"hidden":1536}

LR Schedule

cosine warmdown

parameters: {"formula":"0.5 * (1 + cos(pi * progress))"}

Initialization

OrthoInit

Orthogonal weight initialization for all linear layers except zero-init ones.

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

RoPE base set to 50k via environment variable.

parameters: {"rope_base":50000}

other

Manual KV head repeat used for GQA compatibility instead of enable_gqa flag.

parameters: null