PR #176

open

Add submission: 10L Slide64 Mid6, val_bpb=1.1732

val_bpb

1.1732

Architecture

Transformer

Optimizer

Muon

Artifact Size

15846677 bytes

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings with fp16 passthrough for the tied embedding/output-head tensor.

parameters: null

Quantization

mixed int6/int8

bits: 8

scope: all weights by default, with middle blocks 3,4,5,6 forced to int6; embeddings and LM head kept fp16

Optimizer

Muon

weight_decay: 0.02

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Initialization

spectral init

Uses spectral embedding initialization.

resid mix

Uses phase residual mixing initialization.

Evaluation

sliding window eval

parameters: {"context_length":1024,"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zlib

level: null

Regularization

weight decay

parameters: {"value":0.02}

Improved the prior valid mid6 run by lowering TIED_EMBED_LR from 0.10 to 0.08.
Kept the 10-layer sliding-window family recipe with 1024/64 sliding evaluation.
Used a mixed export policy with only middle blocks 3,4,5,6 forced to int6 while keeping embeddings and LM head in fp16.
Retained the stronger Muon crossover schedule with warmup and warmdown settings.
Achieved a new best valid score for this submission family under the 16MB cap.