PR #131

open

[WIP] add combined optimization, waiting for 8 gpu train

val_bpb

1.2701

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4MB

Training Techniques

Architecture

SwiGLU MLP

Replaces the baseline ReLU-square MLP with a gated SwiGLU feedforward block.

parameters: {"hidden":1024}

tied embeddings

Input and output embeddings are tied.

parameters: null

Quantization

mixed int6/int8

bits: 6

scope: transformer block weights; embeddings use int8

STE QAT

bits: 6

scope: transformer block weights

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"tied_embedding_lr":0.035,"scalar_lr":0.025}

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

cosine decay with linear warmup

parameters: {"warmup_steps":200,"min_lr_ratio":0.05}

Other

other

Adaptive training configuration that changes sequence length, gradient accumulation, batch tokens, and evaluation stride based on GPU count.

parameters: {"train_seq_len_1_gpu":1024,"train_seq_len_8_gpu":2048,"grad_accum_steps_1_gpu":2,"grad_accum_steps_8_gpu":1}