PR #160

open

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)

by ChaseWNortonView on GitHub

val_bpb

1.1623

Architecture

Transformer

Optimizer

Muon

Artifact Size

15910904 bytes

Training Techniques

Architecture

MLP3x

Increased feedforward capacity from 2x to 3x while keeping the baseline Transformer backbone.

parameters: {"mlp_mult":3}

tied embeddings

Uses tied input/output embeddings.

parameters: {"tie_embeddings":1}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RoPE

Uses rotary positional embeddings with RMSNorm and a U-Net-style skip structure inherited from the baseline.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"warmup_steps":20,"warmdown_iters":3000}

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Quantization

mixed int6/int8

bits: 6

scope: most tensors, with int8 token embedding

QAT

bits: null

scope: submission artifact / timed run support, but not activated before stop

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"seq_len":2048,"stride":256}

Other

other

Grouped QGv3 serialization was used to reduce artifact overhead before compression.

parameters: null

Novel Contributions

Increased feedforward capacity from 2x to 3x
Trained and evaluated at sequence length 2048
Used grouped QGv3 serialization to reduce artifact overhead
Kept token embeddings at int8 while quantizing most other tensors to int6
Applied sliding-window evaluation to improve the final under-cap score
Repacked the timed checkpoint into a submission-valid LZMA-compressed artifact